Download pdf - Operating systems

E-528-529, sector-7, Dwarka, New delhi-110075

(Nr. Ramphal chowk and Sector 9 metro station) Ph. 011-47350606, (M) 7838010301-04 www.eduproz.in

Educate Anytime...Anywhere...

"Greetings For The Day" About Eduproz

We, at EduProz, started our voyage with a dream of making higher education available for everyone. Since

its inception, EduProz has been working as a stepping-stone for the students coming from varied

backgrounds. The best part is – the classroom for distance learning or correspondence courses for both

management (MBA and BBA) and Information Technology (MCA and BCA) streams are free of cost.

Experienced faculty-members, a state-of-the-art infrastructure and a congenial environment for learning -

are the few things that we offer to our students. Our panel of industrial experts, coming from various

industrial domains, lead students not only to secure good marks in examination, but also to get an edge over

others in their professional lives. Our study materials are sufficient to keep students abreast of the present

nuances of the industry. In addition, we give importance to regular tests and sessions to evaluate our

students’ progress.

Students can attend regular classes of distance learning MBA, BBA, MCA and BCA courses at EduProz

without paying anything extra. Our centrally air-conditioned classrooms, well-maintained library and well-

equipped laboratory facilities provide a comfortable environment for learning.

Honing specific skills is inevitable to get success in an interview. Keeping this in mind, EduProz has a career

counselling and career development cell where we help student to prepare for interviews. Our dedicated

placement cell has been helping students to land in their dream jobs on completion of the course.

EduProz is strategically located in Dwarka, West Delhi (walking distance from Dwarka Sector 9 Metro

Station and 4-minutes drive from the national highway); students can easily come to our centre from

anywhere Delhi and neighbouring Gurgaon, Haryana and avail of a quality-oriented education facility at

apparently no extra cost.

Why Choose Edu Proz for distance learning?

• Edu Proz provides class room facilities free of cost.

• In EduProz Class room teaching is conducted through experienced faculty.

• Class rooms are spacious fully air-conditioned ensuring comfortable ambience.

• Course free is not wearily expensive.

• Placement assistance and student counseling facilities.

• Edu Proz unlike several other distance learning courses strives to help and motivate pupils to get

high grades thus ensuring that they are well placed in life.

• Students are groomed and prepared to face interview boards.

• Mock tests, unit tests and examinations are held to evaluate progress.

• Special care is taken in the personality development department.

"HAVE A GOOD DAY"

Karnataka State Open University

(KSOU) was established on 1st June 1996 with the assent of H.E. Governor of Karnataka as a full fledged University in the academic year 1996 vide Government notification No/EDI/UOV/dated 12th February 1996 (Karnataka State Open University Act – 1992). The act was promulgated with the object to incorporate an Open University at the State level for the introduction and promotion of Open University and Distance Education systems in the education pattern of the State and the country for the Co-ordination and determination of standard of such systems. Keeping in view the educational needs of our country, in general, and state in particular the policies and programmes have been geared to cater to the needy. Karnataka State Open University is a UGC recognised University of Distance Education Council (DEC), New Delhi, regular member of the Association of Indian Universities (AIU), Delhi, permanent member of Association of Commonwealth Universities (ACU), London, UK, Asian Association of Open Universities (AAOU), Beijing, China, and also has association with Commonwealth of Learning (COL). Karnataka State Open University is situated at the North–Western end of the Manasagangotri campus, Mysore. The campus, which is about 5 kms, from the city centre, has a serene atmosphere ideally suited for academic pursuits. The University houses at present the Administrative Office, Academic Block, Lecture Halls, a well-equipped Library, Guest House Cottages, a Moderate Canteen, Girls Hostel and a few cottages providing limited accommodation to students coming to Mysore for attending the Contact Programmes or Term-end examinations.

Unit 1: Overview of the Operating Systems:

This unit covers introduction, evolution of OS. And also covers the OS components and

its services.

Introduction to Operating Systems

Programs, Code files, Processes and Threads

• A sequence of instructions telling the computer what to do is called a program.

The user normally uses a text editor to write their program in a high level

language, such as Pascal, C, Java, etc. Alternatively, they may write it in

assembly language. Assembly language is a computer language whose statements

have an almost one to one correspondence to the instructions understood by the

CPU of the computer. It provides a way of specifying in precise detail what

machine code the assembler should create.

A compiler is used to translate a high level language program into assembly

language or machine code, and an assembler is used to translate an assembly

language program into machine code. A linker is used to combine relocatable

object files (code files corresponding to incomplete portions of a program) into

executable code files (complete code files, for which the addresses have been

resolved for all global functions and variables).

The text for a program written in a high level language or assembly language is

normally saved in a source file on disk. Machine code for a program is normally

saved in a code file on disk. The machine code is loaded into the virtual memory

for a process, when the process attempts to execute the program.

The notion of a program is becoming more complex nowadays, because of

shared libraries. In the old days, the user code for a process was all in one file.

However, with GUI libraries becoming so large, this is no longer possible.

Library code is now stored in memory that is shared by all processes that use it.

Perhaps it is best to use the term program for the machine code stored in or

derived from a single code file.

Code files contain more than just machine code. On UNIX, a code file starts with

a header, containing information on the position and size of the code (”text”),

initialised data, and uninitialised data segments of the code file. The header also

contains other information, such as the initial value to give the program counter

(the “entry point”) and global pointer register. The data for the code and

initialised data segments then follows.

As well the above information, code files can contain a symbol table – a table

indicating the names of all functions and global variables, and the virtual

addresses they correspond to. The symbol table is used by the linker, when it

combines several relocatable object files into a single executable code file, to

resolve references to functions in shared libraries. The symbol table is also used

for debugging. The structure of UNIX code files on the Alpha is very complex,

due to the use of shared libraries.

• When a user types in the name of a command in the UNIX shell, this results in the

creation of what is called a process. On any large computer, especially one with

more than one person using it at the same time, there are normally many

processes executing at any given time. Under UNIX, every time a user types in a

command, they create a separate process. If several users execute the same

command, then each one creates a different process. The Macintosh is a little

different from UNIX. If the user double clicks on several data files for an

application, only one process is created, and this process manages all the data

files.

A process is the virtual memory, and information on open files, and other

operating system resources, shared by its threads of execution, all executing in the

same virtual memory.

The threads in a process execute not only the code from a user program. They can

also execute the shared library code, operating system kernel code, and (on the

Alpha) what is called PALcode.

A process is created to execute a command. The code file for the command is

used to initialise the virtual memory containing the user code and global

variables. The user stack for the initial thread is cleared, and the parameters to the

command are passed as parameters to the main function of the program. Files are

opened corresponding to the standard input and output (keyboard and screen,

unless file redirection is used).

When a process is created, it is created with a single thread of execution.

Conventional processes never have more than a single thread of execution, but

multi-threaded processes are now becoming common place. We often speak about

a program executing, or a process executing a program, when we really mean a

thread within the process executes the program.

In UNIX, a new process executing a new program is created by the fork() system

call (which creates an almost identical copy of an existing process, executing the

same program), followed by the exec() system call (which replaces the program

being executed by the new program).

In the Java programming language, a new process executing a new program is

created by the exec() method in the Runtime class. The Java exec() is probably

implemented as a combination of the UNIX fork() and exec() system calls.

• A thread is an instance of execution (the entity that executes). All the threads that

make up a process share access to the same user program, virtual memory, open

files, and other operating system resources. Each thread has its own program

counter, general purpose registers, and user and kernel stack. The program

counter and general purpose registers for a thread are stored in the CPU when the

thread is executing, and saved away in memory when it is not executing.

The Java programming language supports the creation of multiple threads. To

create a thread in Java, we create an object that implements the Runnable

interface (has a run() method), and use this to create a new Thread object. To

initiate the execution of the thread, we invoke the start() method of the thread,

which invokes the run() method of the Runnable object. The threads that make up

a process need to use some kind of synchronisation mechanism to avoid more

than one thread accessing shared data at the same time. In Java, synchronisation is

done by synchronised methods. The wait(), notifyO, and notifyAU() methods in

the Object class are used to allow a thread to wait until the data has been updated

by another thread, and to notify other threads when the data has been altered.

In UNIX C, the pthreads library contains functions to create new threads, and

provide the equivalent of synchronised methods, waitO, notifyO, etc. The Java

mechanism is in fact based on the pthreads library. In Java, synchronisation is

built into the design of the language (the compiler knows about synchronised

methods). In C, there is no syntax to specify that a function (method) is

synchronised, and the programmer has to explicitly put in code at the start and

end of the method to gain and relinquish exclusive access to a data structure.

Some people call threads lightweight processes, and processes heavyweight

processes. Some people call processes tasks.

Many application programs, such as Microsoft word, are starting to make use of

multiple threads. For example, there is a thread that processes the input, and a

thread for doing repagination in the background. A compiler could have multiple

threads, one for lexical analysis, one for parsing, one for analysing the abstract

syntax tree. These can all execute in parallel, although the parser cannot execute

ahead of the lexical analyser, and the abstract syntax tree analyser can only

process the portion of the abstract syntax tree already generated by the parser. The

code for performing graphics can easily be sped up by having multiple threads,

each painting a portion of the screen. File and network servers have to deal with

multiple external requests, many of which block before the reply is given. An

elegant way of programming servers is to have a thread for each request.

Multi-threaded processes are becoming very important, because computers with

multiple processors are becoming commonplace, as are distributed systems, and servers.

It is important that you learn how to program in this manner. Multi-threaded

programming, particularly dealing with synchronisation issues, is not trivial, and a good

conceptual understanding of synchronisation is essential. Synchronisation is dealt with

fully in the stage 3 operating systems paper.

Objectives

An operating system can be thought of as having three objectives:

Convenience: An operating system makes a computer more convenient to use.

Efficiency: An operating system allows the computer system resources to be used in an

efficient manner.

Ability to evolve: An operating system should be constructed in such a way as to permit

the effective development, testing and introduction of new system functions without

interfering with current services provided.

What is an Operating System?

An operating system (OS) is a program that controls the execution of an application

program and acts as an interface between the user and computer hardware. The purpose

of an OS is to provide an environment in which a user can execute programs in a

convenient and efficient manner.

The operating system must provide certain services to programs and to the users of those

programs in order to make the programming task easier, these services will differ from

one OS to another.

Functions of an Operating System

Modern Operating systems generally have following three major goals. Operating

systems generally accomplish these goals by running processes in low privilege and

providing service calls that invoke the operating system kernel in high-privilege state.

To hide details of hardware

An abstraction is software that hides lower level details and provides a set of higher-level

functions. An operating system transforms the physical world of devices, instructions,

memory, and time into virtual world that is the result of abstractions built by the

operating system. There are several reasons for abstraction.

First, the code needed to control peripheral devices is not standardized. Operating

systems provide subroutines called device drivers that perform operations on behalf of

programs for example, input/output operations.

Second, the operating system introduces new functions as it abstracts the hardware. For

instance, operating system introduces the file abstraction so that programs do not have to

deal with disks.

Third, the operating system transforms the computer hardware into multiple virtual

computers, each belonging to a different program. Each program that is running is called

a process. Each process views the hardware through the lens of abstraction.

Fourth, the operating system can enforce security through abstraction.

Resources Management

An operating system as resource manager, controls how processes (the active agents)

may access resources (passive entities). One can view Operating Systems from two

points of views: Resource manager and Extended machines. Form Resource manager

point of view Operating Systems manage the different parts of the system efficiently and

from extended machines point of view Operating Systems provide a virtual machine to

users that is more convenient to use. The structurally Operating Systems can be design as

a monolithic system, a hierarchy of layers, a virtual machine system, a micro-kernel, or

using the client-server model. The basic concepts of Operating Systems are processes,

memory management, I/O management, the file systems, and security.

Provide a effective user interface

The user interacts with the operating systems through the user interface and usually

interested in the look and feel of the operating system. The most important components

of the user interface are the command interpreter, the file system, on-line help, and

application integration. The recent trend has been toward increasingly integrated

graphical user interfaces that encompass the activities of multiple processes on networks

of computers.

Evolution of Operating System

Operating system and computer architecture have had a great deal of influence on each

other. To facilitate the use of the hardware, OS’s were developed. As operating systems

were designed and used, it became obvious that changes in the design of the hardware

could simplify them.

Early Systems

In the earliest days of electronic digital computing, everything was done on the bare

hardware. Very few computers existed and those that did exist were experimental in

nature. The researchers who were making the first computers were also the programmers

and the users. They worked directly on the “bare hardware”. There was no operating

system. The experimenters wrote their programs in assembly language and a running

program had complete control of the entire computer. Debugging consisted of a

combination of fixing both the software and hardware, rewriting the object code and

changing the actual computer itself.

The lack of any operating system meant that only one person could use a computer at a

time. Even in the research lab, there were many researchers competing for limited

computing time. The first solution was a reservation system, with researchers signing up

for specific time slots.

The high cost of early computers meant that it was essential that the rare computers be

used as efficiently as possible. The reservation system was not particularly efficient. If a

researcher finished work early, the computer sat idle until the next time slot. If the

researcher’s time ran out, the researcher might have to pack up his or her work in an

incomplete state at an awkward moment to make room for the next researcher. Even

when things were going well, a lot of the time the computer actually sat idle while the

researcher studied the results (or studied memory of a crashed program to figure out what

went wrong).

The solution to this problem was to have programmers prepare their work off-line on

some input medium (often on punched cards, paper tape, or magnetic tape) and then hand

the work to a computer operator. The computer operator would load up jobs in the order

received (with priority overrides based on politics and other factors). Each job still ran

one at a time with complete control of the computer, but as soon as a job finished, the

operator would transfer the results to some output medium (punched tape, paper tape,

magnetic tape, or printed paper) and deliver the results to the appropriate programmer. If

the program ran to completion, the result would be some end data. If the program

crashed, memory would be transferred to some output medium for the programmer to

study (because some of the early business computing systems used magnetic core

memory, these became known as “core dumps”)

Soon after the first successes with digital computer experiments, computers moved out of

the lab and into practical use. The first practical application of these experimental digital

computers was the generation of artillery tables for the British and American armies.

Much of the early research in computers was paid for by the British and American

militaries. Business and scientific applications followed.

As computer use increased, programmers noticed that they were duplicating the same

efforts.

Every programmer was writing his or her own routines for I/O, such as reading input

from a magnetic tape or writing output to a line printer. It made sense to write a common

device driver for each input or output device and then have every programmer share the

same device drivers rather than each programmer writing his or her own. Some

programmers resisted the use of common device drivers in the belief that they could write

“more efficient” or faster or “”better” device drivers of their own.

Additionally each programmer was writing his or her own routines for fairly common

and repeated functionality, such as mathematics or string functions. Again, it made sense

to share the work instead of everyone repeatedly “reinventing the wheel”. These shared

functions would be organized into libraries and could be inserted into programs as

needed. In the spirit of cooperation among early researchers, these library functions were

published and distributed for free, an early example of the power of the open source

approach to software development.

Simple Batch Systems

When punched cards were used for user jobs, processing of a job involved physical

actions by the system operator, e.g., loading a deck of cards into the card reader, pressing

switches on the computer’s console to initiate a job, etc. These actions wasted a lot of

central processing unit (CPU) time.

Operating System

User Program Area

Figure 1.1: Simple Batch System

To speed up processing, jobs with similar needs were batched together and were run as a

group. Batch processing (BP) was implemented by locating a component of the BP

system, called the batch monitor or supervisor, permanently in one part of computer’s

memory. The remaining memory was used to process a user job – the current job in the

batch as shown in the figure 1.1 above.

The delay between job submission and completion was considerable in batch processed

system as a number of programs were put in a batch and the entire batch had to be

processed before the results were printed. Further card reading and printing were slow as

they used slower mechanical units compared to CPU which was electronic. The speed

mismatch was of the order of 1000. To alleviate this problem programs were spooled.

Spool is an acronym for simultaneous peripheral operation on-line. In essence the idea

was to use a cheaper processor known as peripheral processing unit (PPU) to read

programs and data from cards store them on a disk. The faster CPU read programs/data

from the disk processed them and wrote the results back on the disk. The cheaper

processor then read the results from the disk and printed them.

Multi Programmed Batch Systems

Even though disks are faster than card reader/ printer they are still two orders of

magnitude slower than CPU. It is thus useful to have several programs ready to run

waiting in the main memory of CPU. When one program needs input/output (I/O) from

disk it is suspended and another program whose data is already in main memory (as

shown in the figure 1.2 bellow) is taken up for execution. This is called

multiprogramming.

Operating System

Program 1

Program 2

Program 3

Program 4

Figure 1.2: Multi Programmed Batch Systems

Multiprogramming (MP) increases CPU utilization by organizing jobs such that the CPU

always has a job to execute. Multiprogramming is the first instance where the operating

system must make decisions for the user.

The MP arrangement ensures concurrent operation of the CPU and the I/O subsystem. It

ensures that the CPU is allocated to a program only when it is not performing an I/O

operation.

Time Sharing Systems

Multiprogramming features were superimposed on BP to ensure good utilization of CPU

but from the point of view of a user the service was poor as the response time, i.e., the

time elapsed between submitting a job and getting the results was unacceptably high.

Development of interactive terminals changed the scenario. Computation became an on-

line activity. A user could provide inputs to a computation from a terminal and could also

examine the output of the computation on the same terminal. Hence, the response time

needed to be drastically reduced. This was achieved by storing programs of several users

in memory and providing each user a slice of time on CPU to process his/her program.

Distributed Systems

A recent trend in computer system is to distribute computation among several processors.

In the loosely coupled systems the processors do not share memory or a clock. Instead,

each processor has its own local memory. The processors communicate with one another

using communication network.

The processors in a distributed system may vary in size and function, and referred by a

number of different names, such as sites, nodes, computers and so on depending on the

context. The major reasons for building distributed systems are:

Resource sharing: If a number of different sites are connected to one another, then a

user at one site may be able to use the resources available at the other.

Computation speed up: If a particular computation can be partitioned into a number of

sub computations that can run concurrently, then a distributed system may allow a user to

distribute computation among the various sites to run them concurrently.

Reliability: If one site fails in a distributed system, the remaining sites can potentially

continue operations.

Communication: There are many instances in which programs need to exchange data

with one another. Distributed data base system is an example of this.

Real-time Operating System

The advent of timesharing provided good response times to computer users. However,

timesharing could not satisfy the requirements of some applications. Real-time (RT)

operating systems were developed to meet the response requirements of such

applications.

There are two flavors of real-time systems. A hard real-time system guarantees that

critical tasks complete at a specified time. A less restrictive type of real time system is

soft real-time system, where a critical real-time task gets priority over other tasks, and

retains that priority until it completes. The several areas in which this type is useful are

multimedia, virtual reality, and advance scientific projects such as undersea exploration

and planetary rovers. Because of the expanded uses for soft real-time functionality, it is

finding its way into most current operating systems, including major versions of Unix and

Windows NT OS.

A real-time operating system is one, which helps to fulfill the worst-case response time

requirements of an application. An RT OS provides the following facilities for this

purpose:

1. Multitasking within an application.

2. Ability to define the priorities of tasks.

3. Priority driven or deadline oriented scheduling.

4. Programmer defined interrupts.

A task is a sub-computation in an application program, which can be executed

concurrently with other sub-computations in the program, except at specific places in its

execution called synchronization points. Multi-tasking, which permits the existence of

many tasks within the application program, provides the possibility of overlapping the

CPU and I/O activities of the application with one another. This helps in reducing its

elapsed time. The ability to specify priorities for the tasks provides additional controls to

a designer while structuring an application to meet its response-time requirements.

Real time operating systems (RTOS) are specifically designed to respond to events that

happen in real time. This can include computer systems that run factory floors, computer

systems for emergency room or intensive care unit equipment (or even the entire ICU),

computer systems for air traffic control, or embedded systems. RTOSs are grouped

according to the response time that is acceptable (seconds, milliseconds, microseconds)

and according to whether or not they involve systems where failure can result in loss of

life. Examples of real-time operating systems include QNX, Jaluna-1, ChorusOS,

LynxOS, Windows CE .NET, and VxWorks AE, etc.

Self assessment questions

1. What do the terms program, process, and thread mean?

2. What is the purpose of a compiler, assembler and linker?

3. What is the structure of a code file? What is the purpose of the symbol table in a

code file?

4. Why are shared libraries essential on modern computers?

Operating System Components

Even though, not all systems have the same structure many modern operating systems

share the same goal of supporting the following types of system components.

Process Management

The operating system manages many kinds of activities ranging from user programs to

system programs like printer spooler, name servers, file server etc. Each of these

activities is encapsulated in a process. A process includes the complete execution context

(code, data, PC, registers, OS resources in use etc.).

It is important to note that a process is not a program. A process is only ONE instant of a

program in execution. There are many processes can be running the same program. The

five major activities of an operating system in regard to process management

are1. Creation and deletion of user and system processes.

2. Suspension and resumption of processes.

3. A mechanism for process synchronization.

4. A mechanism for process communication.

5. A mechanism for deadlock handling.

Main-Memory Management

Primary-Memory or Main-Memory is a large array of words or bytes. Each word or byte

has its own address. Main-memory provides storage that can be access directly by the

CPU. That is to say for a program to be executed, it must in the main memory.

The major activities of an operating in regard to memory-management are:

1. Keep track of which part of memory are currently being used and by whom.

2. Decide which processes are loaded into memory when memory space becomes

available.

3. Allocate and de-allocate memory space as needed.

File Management

A file is a collection of related information defined by its creator. Computer can store

files on the disk (secondary storage), which provides long term storage. Some examples

of storage media are magnetic tape, magnetic disk and optical disk. Each of these media

has its own properties like speed, capacity, data transfer rate and access methods.

A file system normally organized into directories to ease their use. These directories may

contain files and other directions.

The five main major activities of an operating system in regard to file management are

1. The creation and deletion of files.

2. The creation and deletion of directions.

3. The support of primitives for manipulating files and directions.

4. The mapping of files onto secondary storage.

5. The back up of files on stable storage media.

I/O System Management

I/O subsystem hides the peculiarities of specific hardware devices from the user. Only the

device driver knows the peculiarities of the specific device to whom it is assigned.

Secondary-Storage Management

Generally speaking, systems have several levels of storage, including primary storage,

secondary storage and cache storage. Instructions and data must be placed in primary

storage or cache to be referenced by a running program. Because main memory is too

small to accommodate all data and programs, and its data are lost when power is lost, the

computer system must provide secondary storage to back up main memory. Secondary

storage consists of tapes, disks, and other media designed to hold information that will

eventually be accessed in primary storage (primary, secondary, cache) is ordinarily

divided into bytes or words consisting of a fixed number of bytes. Each location in

storage has an address; the set of all addresses available to a program is called an address

space.

The three major activities of an operating system in regard to secondary storage

management are:

1. Managing the free space available on the secondary-storage device.

2. Allocation of storage space when new files have to be written.

3. Scheduling the requests for memory access.

Networking

A distributed system is a collection of processors that do not share memory, peripheral

devices, or a clock. The processors communicate with one another through

communication lines called network. The communication-network design must consider

routing and connection strategies, and the problems of contention and security.

Protection System

If a computer system has multiple users and allows the concurrent execution of multiple

processes, then various processes must be protected from one another’s activities.

Protection refers to mechanism for controlling the access of programs, processes, or users

to the resources defined by a computer system.

Command Interpreter System

A command interpreter is an interface of the operating system with the user. The user

gives commands with are executed by operating system (usually by turning them into

system calls). The main function of a command interpreter is to get and execute the next

user specified command. Command-Interpreter is usually not part of the kernel, since

multiple command interpreters (shell, in UNIX terminology) may be supported by an

operating system, and they do not really need to run in kernel mode. There are two main

advantages of separating the command interpreter from the kernel.

1. If we want to change the way the command interpreter looks, i.e., I want to

change the interface of command interpreter, I am able to do that if the command

interpreter is separate from the kernel. I cannot change the code of the kernel so I

cannot modify the interface.

2. If the command interpreter is a part of the kernel, it is possible for a malicious

process to gain access to certain part of the kernel that it should not have. To

avoid this scenario it is advantageous to have the command interpreter separate

from kernel.

Self Assessment Questions

1. Discuss the various components of OS?

2. Explain the Memory Management and File Management in brief.

3. Write Note on.

1. Secondary-Storage Management

2. Command Interpreter System

Operating System Services

Following are the five services provided by operating systems for the convenience of the

users.

Program Execution

The purpose of a computer system is to allow the user to execute programs. So the

operating system provides an environment where the user can conveniently run programs.

The user does not have to worry about the memory allocation or multitasking or

anything. These things are taken care of by the operating systems.

Running a program involves the allocating and de-allocating memory, CPU scheduling in

case of multi-process. These functions cannot be given to the user-level programs. So

user-level programs cannot help the user to run programs independently without the help

from operating systems.

I/O Operations

Each program requires an input and produces output. This involves the use of I/O. The

operating systems hides from the user the details of underlying hardware for the I/O. All

the users see that the I/O has been performed without any details. So the operating

system, by providing I/O, makes it convenient for the users to run programs.

For efficiently and protection users cannot control I/O so this service cannot be provided

by user-level programs.

File System Manipulation

The output of a program may need to be written into new files or input taken from some

files. The operating system provides this service. The user does not have to worry about

secondary storage management. User gives a command for reading or writing to a file

and sees his/her task accomplished. Thus operating system makes it easier for user

programs to accomplish their task.

This service involves secondary storage management. The speed of I/O that depends on

secondary storage management is critical to the speed of many programs and hence I

think it is best relegated to the operating systems to manage it than giving individual

users the control of it. It is not difficult for the user-level programs to provide these

services but for above mentioned reasons it is best if this service is left with operating

system.

Communications

There are instances where processes need to communicate with each other to exchange

information. It may be between processes running on the same computer or running on

the different computers. By providing this service the operating system relieves the user

from the worry of passing messages between processes. In case where the messages need

to be passed to processes on the other computers through a network, it can be done by the

user programs. The user program may be customized to the specifications of the

hardware through which the message transits and provides the service interface to the

operating system.

Error Detection

An error in one part of the system may cause malfunctioning of the complete system. To

avoid such a situation the operating system constantly monitors the system for detecting

the errors. This relieves the user from the worry of errors propagating to various part of

the system and causing malfunctioning.

This service cannot be allowed to be handled by user programs because it involves

monitoring and in cases altering area of memory or de-allocation of memory for a faulty

process, or may be relinquishing the CPU of a process that goes into an infinite loop.

These tasks are too critical to be handed over to the user programs. A user program if

given these privileges can interfere with the correct (normal) operation of the operating

systems.


1. Explain the five services provided by the operating system.

Operating Systems for Different Computers

Operating systems can be grouped according to functionality: operating systems for

Supercomputers, Computer Clusters, Mainframes, Servers, Workstations, Desktops,

Handheld Devices, Real Time Systems, or Embedded Systems.

OS for Supercomputers:

Supercomputers are the fastest computers, very expensive and are employed for

specialized applications that require immense amounts of mathematical calculations, for

example, weather forecasting, animated graphics, fluid dynamic calculations, nuclear

energy research, and petroleum exploration. Out of many operating systems used for

supercomputing UNIX and Linux are the most dominant ones.

Computer Clusters Operating Systems:

A computer cluster is a group of computers that work together closely so that in many

respects they can be viewed as though they are a single computer. The components of a

cluster are commonly, connected to each other through fast local area networks. Besides

many open source operating systems, and two versions of Windows 2003 Server, Linux

is popularly used for Computer clusters.

Mainframe Operating Systems:

Mainframes used to be the primary form of computer. Mainframes are large centralized

computers and at one time they provided the bulk of business computing through time

sharing. Mainframes are still useful for some large scale tasks, such as centralized billing

systems, inventory systems, database operations, etc.

Minicomputers were smaller, less expensive versions of mainframes for businesses that

couldn’t afford true mainframes. The chief difference between a supercomputer and a

mainframe is that a supercomputer channels all its power into executing a few programs

as fast as possible, whereas a mainframe uses its power to execute many programs

concurrently. Besides various versions of operating systems by IBM for its early

System/360, to newest Z series operating system z/OS, Unix and Linux are also used as

mainframe operating systems.

Servers Operating Systems:

Servers are computers or groups of computers that provides services to other computers,

connected via network. Based on the requirements, there are various versions of server

operating systems from different vendors, starting with Microsoft’s Servers from

Windows NT to Windows 2003, OS/2 servers, UNIX servers, Mac OS servers, and

various flavors of Linux.

Workstation Operating Systems:

Workstations are more powerful versions of personal computers. Like desktop

computers, often only one person uses a particular workstation, and run a more powerful

version of a desktop operating system. Most of the times workstations are used as clients

in a network environment. The popular workstation operating systems are Windows NT

Workstation, Windows 2000 Professional, OS/2 Clients, Mac OS, UNIX, Linux, etc

Desktop Operating Systems:

A personal computer (PC) is a microcomputer whose price, size, and capabilities make it

useful for individuals, also known as Desktop computers or home computers

Desktop operating systems are used for personal computers, for example DOS, Windows

9x, Windows XP, Macintosh OS, Linux, etc.

Embedded Operating Systems:

Embedded systems are combinations of processors and special software that are inside of

another device, such as the electronic ignition system on cars. Examples of embedded

operating systems are Embedded Linux, Windows CE, Windows XP Embedded, Free

DOS, Free RTOS, etc.

Operating Systems for Handheld Computers:

Handheld operating systems are much smaller and less capable than desktop operating

systems, so that they can fit into the limited memory of handheld devices. The operating

systems include Palm OS, Windows CE, EPOC, and Summary

An operating system (OS) is a program that controls the execution of an application

program and acts as an interface between the user and computer hardware. The objectives

of operating system are convenience, efficiency, and ability to evolve. Besides this the

operating system performs function such as hiding details of the hardware, resource

management, and providing effective user interface.

The process management component of operating system is responsible for creation,

termination, other and state transitions of a process. The memory management unit is

mainly responsible for allocation, de-allocation to processes, and keeping track records of

memory usage by different processes. The operating system services are program

execution, I/O operations, file system manipulation, communication and error detection.

Terminal Questions

1. What is an operating system?

2. What are the objectives of an operating system?

3. Describe in brief, the function of an operating system.

4. Explain the evolution of operating system in brief.

5. Write a note on Batch OS. Discuss how it is differ from Multi Programmed Batch

Systems.

6. What is difference between multi-programming and timesharing operating

systems?

7. What are the typical features of an operating system provides?

8. Explain the functions of operating system as file manager.

9. What are different services provided by an operating system?

10. Write Note on :

1.Mainframe Operating Systems

2.Embedded Operating Systems

3.Servers Operating Systems

4.Desktop Operating Systems

many Linux versions such as Qt Palmtop, and Pocket Linux, etc.

Unit 2: Operating System Architecture :

This unit deals with the Simple structure, extended machine, layered approaches. It

covers the different methodology for OS design (Models). It covers the Introduction of

Virtual Machine, Virtual environment and Machine aggregation. And also describes the

implementation techniques.

Introduction

A system as large and complex as a modern operating system must be engineered

carefully if it is to function properly and be modified easily. A common approach is to

partition the task into small component rather than have one monolithic system. Each of

these modules should be a well-defined portion of the system, with carefully defined

inputs, outputs, and functions. In this unit, we discuss how various components of an

operating system are interconnected and melded into a kernel.

Objective:

At the end of this unit, readers would be able to understand:

• What is Kernel? Monolithic Kernel Architecture

• Layered Architecture

• Microkernel Architecture

• Operating System Components

• Operating System Services

OS as an Extended Machine

We can think of an operating system as an Extended Machine standing between our

programs and the bare hardware.

As shown in above figure 2.1, the operating system interacts with the hardware hiding it

from the application program, and user. Thus it acts as interface between user programs

and hardware.


1. What is the role of an Operating System?

Simple Structure

Many commercial systems do not have well-defined structures. Frequently, such

operating systems started as small, simple, and limited systems and then grew beyond

their original scope. MS-DOS is an example of such a system. It was originally designed

and implemented by a few people who had no idea that it would become so popular. It

was written to provide the most functionality in the least space, so it was not divided into

modules carefully. In MS-DOS, the interfaces and levels of functionality are not well

separated. For instance, application programs are able to access the basic I/O routines to

write directly to the display and disk drives. Such freedom leaves MS-DOS vulnerable to

errant (or malicious) programs, causing entire system crashes when user programs fail.

Of course, MS-DOS was also limited by the hardware of its era. Because the Intel 8088

for which it was written provides no dual mode and no hardware protection, the designers

of MS-DOS had no choice but to leave the base hardware accessible.

Another example of limited structuring is the original UNIX operating system. UNIX is

another system that initially was limited by hardware functionality. It consists of two

separable parts:

• the kernel and

• the system programs

The kernel is further separated into a series of interfaces and device drivers, which have

been added and expanded over the years as UNIX has evolved. We can view the

traditional UNIX operating system as being layered. Everything below the system call

interface and above the physical hardware is the kernel. The kernel provides the file

system, CPU scheduling, memory management, and other operating-system functions

through system calls. Taken in sum, that is an enormous amount of functionality to be

combined into one level. This monolithic structure was difficult to implement and

maintain.


1. ”In MS-DOS, the interfaces and levels of functionality are not well separated”.

Comment on this.

2. What are the components of a Unix Operating System?

Layered Approach

With proper hardware support, operating systems can be broken into pieces that are

smaller and more appropriate than those allowed by the original MS-DOS or UNIX

systems. The operating system can then retain much greater control over the computer

and over the applications that make use of that computer. Implementers have more

freedom in changing the inner workings of the system and in creating modular operating

systems. Under the top-down approach, the overall functionality and features are

determined and the separated into components. Information hiding is also important,

because it leaves programmers free to implement the low-level routines as they see fit,

provided that the external interface of the routine stays unchanged and that the routine

itself performs the advertised task.

A system can be made modular in many ways. One method is the layered approach, in

which the operating system is broken up into a number of layers (levels). The bottom

layer (layer 0) id the hardware; the highest (layer N) is the user interface.

Users

File Systems

Inter-process Communication

I/O and Device Management

Virtual Memory

Primitive Process Management

Hardware

Fig. 2.2: Layered Architecture

An operating-system layer is an implementation of an abstract object made up of data and

the operations that can manipulate those data. A typical operating – system layer-say,

layer M-consists of data structures and a set of routines that can be invoked by higher-

level layers. Layer M, in turn, can invoke operations on lower-level layers.

The main advantage of the layered approach is simplicity of construction and debugging.

The layers are selected so that each uses functions (operations) and services of only

lower-level layers. This approach simplifies debugging and system verification. The first

layer can be debugged without any concern for the rest of the system, because, by

definition, it uses only the basic hardware (which is assumed correct) to implement its

functions. Once the first layer is debugged, its correct functioning can be assumed while

the second layer is debugged, and so on. If an error is found during debugging of a

particular layer, the error must be on that layer, because the layers below it are already

debugged. Thus, the design and implementation of the system is simplified.

Each layer is implemented with only those operations provided by lower-level layers. A

layer does not need to know how these operations are implemented; it needs to know

only what these operations do. Hence, each layer hides the existence of certain data

structures, operations, and hardware from higher-level layers. The major difficulty with

the layered approach involves appropriately defining the various layers. Because layer

can use only lower-level layers, careful planning is necessary. For example, the device

driver for the backing store (disk space used by virtual-memory algorithms) must be at a

lower level than the memory-management routines, because memory management

requires the ability to use the backing store.

Other requirement may not be so obvious. The backing-store driver would normally be

above the CPU scheduler, because the driver may need to wait for I/O and the CPU can

be rescheduled during this time. However, on a larger system, the CPU scheduler may

have more information about all the active processes than can fit in memory. Therefore,

this information may need to be swapped in and out of memory, requiring the backing-

store driver routine to be below the CPU scheduler.

A final problem with layered implementations is that they tend to be less efficient than

other types. For instance, when a user program executes an I/O operation, it executes a

system call that is trapped to the I/O layer, which calls the memory-management layer,

which in turn calls the CPU-scheduling layer, which is then passed to the hardware. At

each layer, the parameters may be modified; data may need to be passed, and so on. Each

layer adds overhead to the system call; the net result is a system call that takes longer

than does one on a non-layered system. These limitations have caused a small backlash

against layering in recent years. Fewer layers with more functionality are being designed,

providing most of the advantages of modularized code while avoiding the difficult

problems of layer definition and interaction.


1. What is the layered Architecture of UNIX?

2. What are the advantages of layered Architecture?

Micro-kernels

We have already seen that as UNIX expanded, the kernel became large and difficult to

manage. In the mid-1980s, researches at Carnegie Mellon University developed an

operating system called Mach that modularized the kernel using the microkernel

approach. This method structures the operating system by removing all nonessential

components from the kernel and implementing then as system and user-level programs.

The result is a smaller kernel. There is little consensus regarding which services should

remain in the kernel and which should be implemented in user space. Typically, however,

micro-kernels provide minimal process and memory management, in addition to a

communication facility.

Device

Drivers

File Server

Client Process

….

Virtual Memory

Microkernel

Hardware

Fig. 2.3: Microkernel Architecture

The main function of the microkernel is to provide a communication facility between the

client program and the various services that are also running in user space.

Communication is provided by message passing. For example, if the client program and

service never interact directly. Rather, they communicate indirectly by exchanging

messages with the microkernel.

On benefit of the microkernel approach is ease of extending the operating system. All

new services are added to user space and consequently do not require modification of the

kernel. When the kernel does have to be modified, the changes tend to be fewer, because

the microkernel is a smaller kernel. The resulting operating system is easier to port from

one hardware design to another. The microkernel also provided more security and

reliability, since most services are running as user – rather than kernel – processes, if a

service fails the rest of the operating system remains untouched.

Several contemporary operating systems have used the microkernel approach. Tru64

UNIX (formerly Digital UNIX provides a UNIX interface to the user, but it is

implemented with a March kernel. The March kernel maps UNIX system calls into

messages to the appropriate user-level services.

The following figure shows the UNIX operating system architecture. At the center is

hardware, covered by kernel. Above that are the UNIX utilities, and command interface,

such as shell (sh), etc.

SelAssessment Questions

1. What other facilities Micro-kernel provides in addition to Communication

facility?

2. What are the benefits of Micro-kernel?

UNIX kernel Components

The UNIX kernel has components as depicted in the figure 2.5 bellow. The figure is

divided in to three modes: user mode, kernel mode, and hardware. The user mode

contains user programs which can access the services of the kernel components using

system call interface.

The kernel mode has four major components: system calls, file subsystem, process

control subsystem, and hardware control. The system calls are interface between user

programs and file and process control subsystems. The file subsystem is responsible for

file and I/O management through device drivers.

The process control subsystem contains scheduler, Inter-process communication and

memory management. Finally the hardware control is the interface between these two

subsystems and hardware.

Fig. 2.5: Unix kernel components

Another example is QNX. QNX is a real-time operating system that is also based on the

microkernel design. The QNX microkernel provides services for message passing and

process scheduling. It also handled low-level network communication and hardware

interrupts. All other services in QNX are provided by standard processes that run outside

the kernel in user mode.

Unfortunately, microkernels can suffer from performance decreases due to increased

system function overhead. Consider the history of Windows NT. The first release had a

layered microkernels organization. However, this version delivered low performance

compared with that of Windows 95. Windows NT 4.0 partially redressed the performance

problem by moving layers from user space to kernel space and integrating them more

closely. By the time Windows XP was designed, its architecture was more monolithic

than microkernel.


1. What are the components of UNIX Kernel?

2. Under what circumstances a Micro-kernel may suffer from performance

decrease?

Modules

Perhaps the best current methodology for operating-system design involves using object-

oriented programming techniques to create a modular kernel. Here, the kernel has a set of

core components and dynamically links in additional services either during boot time or

during run time. Such a strategy uses dynamically loadable modules and is common in

modern implementations of UNIX, such as Solaris, Linux and MacOSX. For example,

the Solaris operating system structure is organized around a core kernel with seven types

of loadable kernel modules:

1. Scheduling classes

2. File systems

3. Loadable system calls

4. Executable formats

5. STREAMS formats

6. Miscellaneous

7. Device and bus drivers

Such a design allow the kernel to provide core services yet also allows certain

features to be implemented dynamically. For example device and bus drivers for

specific hardware can be added to the kernel, and support for different file

systems can be added as loadable modules. The overall result resembles a layered

system in that each kernel section has defined, protected interfaces; but it is more

flexible than a layered system in that any module can call any other module.

Furthermore, the approach is like the microkernel approach in that the primary

module has only core functions and knowledge of how to load and communicate

with other modules; but it is more efficient, because modules do not need to

invoke message passing in order to communicate.


1. Which strategy uses dynamically loadable modules and is common in

modern implementations of UNIX?

2. What are different loadable modules based on which the Solaris operating

system structure is organized around a core kernel?

Introduction to Virtual Machine

The layered approach of operating systems is taken to its logical conclusion in the

concept of virtual machine. The fundamental idea behind a virtual machine is to abstract

the hardware of a single computer (the CPU, Memory, Disk drives, Network Interface

Cards, and so forth) into several different execution environments and thereby creating

the illusion that each separate execution environment is running its own private

computer. By using CPU Scheduling and Virtual Memory techniques, an operating

system can create the illusion that a process has its own processor with its own (virtual)

memory. Normally a process has additional features, such as system calls and a file

system, which are not provided by the hardware. The Virtual machine approach does not

provide any such additional functionality but rather an interface that is identical to the

underlying bare hardware. Each process is provided with a (virtual) copy of the

underlying computer.

Hardware Virtual machine

The original meaning of virtual machine, sometimes called a hardware virtual

machine, is that of a number of discrete identical execution environments on a single

computer, each of which runs an operating system (OS). This can allow applications

written for one OS to be executed on a machine which runs a different OS, or provide

execution “sandboxes” which provide a greater level of isolation between processes than

is achieved when running multiple processes on the same instance of an OS. One use is to

provide multiple users the illusion of having an entire computer, one that is their

“private” machine, isolated from other users, all on a single physical machine. Another

advantage is that booting and restarting a virtual machine can be much faster than with a

physical machine, since it may be possible to skip tasks such as hardware initialization.

Such software is now often referred to with the terms virtualization and virtual servers.

The host software which provides this capability is often referred to as a virtual machine

monitor or hypervisor.

Software virtualization can be done in three major ways:· Emulation, full system

simulation, or “full virtualization with dynamic recompilation” — the virtual machine

simulates the complete hardware, allowing an unmodified OS for a completely different

CPU to be run.· Paravirtualization — the virtual machine does not simulate hardware but

instead offers a special API that requires OS modifications. An example of this is

XenSource’s XenEnterprise (www.xensource.com)· Native virtualization and “full

virtualization” — the virtual machine only partially simulates enough hardware to allow

an unmodified OS to be run in isolation, but the guest OS must be designed for the same

type of CPU. The term native virtualization is also sometimes used to designate that

hardware assistance through Virtualization Technology is used.

Application virtual machine

Another meaning of virtual machine is a piece of computer software that isolates the

application being used by the user from the computer. Because versions of the virtual

machine are written for various computer platforms, any application written for the

virtual machine can be operated on any of the platforms, instead of having to produce

separate versions of the application for each computer and operating system. The

application is run on the computer using an interpreter or Just In Time compilation. One

of the best known examples of an application virtual machine is Sun Microsystem’s Java

Virtual Machine.


1. What do you mean by a Virtual Machine?

2. Differentiate Hardware Virtual Machines and Software Virtual Machines.

Virtual Environment

A virtual environment (otherwise referred to as Virtual private server) is another kind of

a virtual machine. In fact, it is a virtualized environment for running user-level programs

(i.e. not the operating system kernel and drivers, but applications). Virtual environments

are created using the software implementing operating system-level virtualization

approach, such as Virtuozzo, FreeBSD Jails, Linux-VServer, Solaris Containers, chroot

jail and OpenVZ.

Machine Aggregation

A less common use of the term is to refer to a computer cluster consisting of many

computers that have been aggregated together as a larger and more powerful “virtual”

machine. In this case, the software allows a single environment to be created spanning

multiple computers, so that the end user appears to be using only one computer rather

than several.

PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are two common

software packages that permit a heterogeneous collection of networked UNIX and/or

Windows computers to be used as a single, large, parallel computer. Thus large

computational problems can be solved more cost effectively by using the aggregate

power and memory of many computers than with a traditional supercomputer. The Plan9

Operating System from Bell Labs uses this approach.

Boston Circuits had released the gCore (grid-on-chip) Central Processing Unit (CPU)

with 16 ARC 750D cores and a Time-machine hardware module to provide a virtual

machine that uses this approach.


1. What is Virtual Environment?

2. Explain Machine Aggregation.

Implementation Techniques

Emulation of the underlying raw hardware (native execution)

This approach is described as full virtualization of the hardware, and can be implemented

using a Type 1 or Type 2 hypervisor. (A Type 1 hypervisor runs directly on the hardware;

a Type 2 hypervisor runs on another operating system, such as Linux.) Each virtual

machine can run any operating system supported by the underlying hardware. Users can

thus run two or more different “guest” operating systems simultaneously, in separate

“private” virtual computers.

The pioneer system using this concept was IBM’s CP-40, the first (1967) version of

IBM’s CP/CMS (1967-1972) and the precursor to IBM’s VM family (1972-present).

With the VM architecture, most users run a relatively simple interactive computing

single-user operating system, CMS, as a “guest” on top of the VM control program (VM-

CP). This approach kept the CMS design simple, as if it were running alone; the control

program quietly provides multitasking and resource management services “behind the

scenes”. In addition to CMS, VM users can run any of the other IBM operating systems,

such as MVS or z/OS. z/VM is the current version of VM, and is used to support

hundreds or thousands of virtual machines on a given mainframe. Some installations use

Linux for zSeries to run Web servers, where Linux runs as the operating system within

many virtual machines.

Full virtualization is particularly helpful in operating system development, when

experimental new code can be run at the same time as older, more stable, versions, each

in separate virtual machines. (The process can even be recursive: IBM debugged new

versions of its virtual machine operating system, VM, in a virtual machine running under

an older version of VM, and even used this technique to simulate new hardware.)

The x86 processor architecture as used in modern PCs does not actually meet the Popek

and Goldberg virtualization requirements. Notably, there is no execution mode where all

sensitive machine instructions always trap, which would allow per-instruction

virtualization.

Despite these limitations, several software packages have managed to provide

virtualization on the x86 architecture, even though dynamic recompilation of privileged

code, as first implemented by VMware, incurs some performance overhead as compared

to a VM running on a natively virtualizable architecture such as the IBM System/370 or

Motorola MC68020. By now, several other software packages such as Virtual PC,

VirtualBox, Parallels Workstation and Virtual Iron manage to implement virtualization

on x86 hardware.

On the other hand, plex86 can run only Linux under Linux using a specific patched

kernel. It does not emulate a processor, but uses bochs for emulation of motherboard

devices.

Intel and AMD have introduced features to their x86 processors to enable virtualization

in hardware.

Emulation of a non-native system

Virtual machines can also perform the role of an emulator, allowing software applications

and operating systems written for computer processor architecture to be run.

Some virtual machines emulate hardware that only exists as a detailed specification. For

example:

• One of the first was the p-code machine specification, which allowed

programmers to write Pascal programs that would run on any computer running

virtual machine software that correctly implemented the specification.

• The specification of the Java virtual machine.

• The Common Language Infrastructure virtual machine at the heart of the

Microsoft .NET initiative.

• Open Firmware allows plug-in hardware to include boot-time diagnostics,

configuration code, and device drivers that will run on any kind of CPU.

This technique allows diverse computers to run any software written to that specification;

only the virtual machine software itself must be written separately for each type of

computer on which it runs.


1. What are the techniques to realize Virtual Machines concept?

2. What are the advantages of Virtual Machines?

Operating system-level virtualization

Operating System-level Virtualization is a server virtualization technology which

virtualizes servers on an operating system (kernel) layer. It can be thought of as

partitioning: a single physical server is sliced into multiple small partitions (otherwise

called virtual environments (VE), virtual private servers (VPS), guests, zones etc); each

such partition looks and feels like a real server, from the point of view of its users.

The operating system level architecture has low overhead that helps to maximize efficient

use of server resources. The virtualization introduces only a negligible overhead and

allows running hundreds of virtual private servers on a single physical server. In contrast,

approaches such as virtualisation (like VMware) and paravirtualization (like Xen or

UML) cannot achieve such level of density, due to overhead of running multiple kernels.

From the other side, operating system-level virtualization does not allow running

different operating systems (i.e. different kernels), although different libraries,

distributions etc. are possible


1. Describe the Operating System Level Virtualization.

Summary

The virtual machine concept has several advantages. In this environment, there is

complete protection of the various system resources. Each virtual machine is completely

isolated from all other virtual machines, so there are no protection problems. At the same

time, however, there is no direct sharing of resources. Two approaches to provide sharing

have been implemented. A virtual machine is a perfect vehicle for operating systems

research and development.

Operating system as extended machine acts as interface between hardware and user

application programs. The kernel is the essential center of a computer operating system,

i.e. the core that provides basic services for all other parts of the operating system. It

includes interrupts handler, scheduler, operating system address space manager, etc.

In the layered type architecture of operating systems, the components of kernel are built

as layers on one another, and each layer can interact with its neighbor through interface.

Whereas in micro-kernel architecture, most of these components are not part of kernel but

acts as another layer to the kernel, and the kernel comprises of essential and basic

components.

Terminal Questions

1. Explain operating system as extended machine.

2. What is a kernel? What are the main components of a kernel?

3. Explain monolithic type of kernel architecture in brief.

4. What is a micro-kernel? Describe its architecture.

5. Compare micro-kernel with layered architecture of operating system.

6. Describe UNIX kernel components in brief.

7. What are the components of operating system?

8. Explain the responsibilities of operating system as process management.

9. Explain the function of operating system as file management.

10. What are different services provided by an operating system?

Unit 3: Process Management :

This unit covers the process management and threads. Brief about the process creation,

termination, process state and process control. Discussed about the process Vs

Threads, Types of threads etc.

Introduction

This unit discuss the definition of process, process creation, process termination, process

state, and process control. And also deals with the threads and thread types.

A process can be simply defined as a program in execution. Process along with program

code, comprises of program counter value, Processor register contents, values of

variables, stack and program data.

A process is created and terminated, and it follows some or all of the states of process

transition; such as New, Ready, Running, Waiting, and Exit.

A thread is a single sequence stream within in a process. Because threads have some of

the properties of processes, they are sometimes called lightweight processes. There are

two types of threads: user level threads (ULT) and kernel level threads (KLT), user level

threads are mostly used on the systems where the operating system does not support

threads, but also can be combined with the kernel level threads. Threads also have similar

properties like processes e.g. execution states, context switch etc.

Objectives :

At the end of this unit, you will be able to understand the :� What is a Process?

� Process Creation , Process Termination,

� Process States, Process Control

� Threads

� Types of Threads

What is a Process?

The notion of process is central to the understanding of operating systems. The term

process is used somewhat interchangeably with ‘task’ or ‘job’. There are quite a few

definitions presented in the literature, for instance� A program in Execution.

� An asynchronous activity.

� The entity to which processors are assigned.

� The ‘dispatchable’ unit.

And many more, but the definition “Program in Execution” seem to be most frequently

used. And this is a concept we will use in the present study of operating systems.

Now that we agreed upon the definition of process, the question is, what is the relation

between process and program, or is it same with different name or when the process is

sleeping (not executing) it is called program and when it is executing becomes process.

Well, to be very precise. Process is not the same as program. A process is more than a

program code. A process is an ‘active’ entity as oppose to program which considered

being a ‘passive’ entity. As we all know that a program is an algorithm expressed in some

programming language. Being a passive, a program is only a part of process. Process, on

the other hand, includes:

� Current value of Program Counter (PC)

� Contents of the processors registers

� Value of the variables

� The process stack, which typically contains temporary data such as subroutine

parameter, return address, and temporary variables.

� A data section that contains global variables.

� A process is the unit of work in a system.

In Process model, all software on the computer is organized into a number of sequential

processes. A process includes PC, registers, and variables. Conceptually, each process

has its own virtual CPU. In reality, the CPU switches back and forth among processes.

Process Creation

In general-purpose systems, some way is needed to create processes as needed during

operation. There are four principal events led to processes creation.� System

initialization.

� Execution of a process Creation System call by a running process.

� A user request to create a new process.

� Initialization of a batch job.

Foreground processes interact with users. Background processes that stay in background

sleeping but suddenly springing to life to handle activity such as email, webpage,

printing, and so on. Background processes are called daemons. This call creates an exact

clone of the calling process.

A process may create a new process by executing system call ‘fork’ in UNIX. Creating

process is called parent process and the created one is called the child processes. Only

one parent is needed to create a child process. This creation of process (processes) yields

a hierarchical structure of processes. Note that each child has only one parent but each

parent may have many children. After the fork, the two processes, the parent and the

child, initially have the same memory image, the same environment strings and the same

open files. After a process is created, both the parent and child have their own distinct

address space.

Following are some reasons for creation of a process

1. User logs on.

2. User starts a program.

3. Operating systems creates process to provide service, e.g., to manage printer.

4. Some program starts another process.

Creation of a process involves following steps:

1. Assign a unique process identifier to the new process, followed by making new

entry in to the process table regarding this process.

2. Allocate space for the process: this operating involves finding how much space is

needed by the process and allocating space to the parts of the process such as user

program, user data, stack and process attributes. The requirement of the space can be

taken by default based on the type of the process, or from the parent process if the

process is spawned by another process.

3. Initialize Process Control Block: the PCB contains various attributes required to

execute and control a process, such as process identification, processor status

information and control information. This can be initialized to standard default values

plus attributes that have been requested for this process.

4. Set the appropriate linkages: the operating system maintains various queues

related to a process in the form of linked lists, the newly created process should be

attached to one of such queues.

5. Create or expand other data structures: depending on the implementation, an

operating system may need to create some data structures for this process, for

example to maintain accounting file for billing or performance assessment.

Process Termination

A process terminates when it finishes executing its last statement. Its resources are

returned to the system, it is purged from any system lists or tables, and its process control

block (PCB) is erased i.e., the PCB’s memory space is returned to a free memory pool.

The new process terminates the existing process, usually due to following reasons:

• Normal Exit Most processes terminates because they have done their job. This

call is exit in UNIX.

• Error Exit When process discovers a fatal error. For example, a user tries to

compile a program that does not exist.

• Fatal Error An error caused by process due to a bug in program for example,

executing an illegal instruction, referring non-existing memory or dividing by

zero.

• Killed by another Process A process executes a system call telling the

Operating Systems to terminate some other process.

Process States

A process goes through a series of discrete process states during its lifetime. Depending

on the implementation, the operating systems may differ in the number of states a process

goes though. Though there are various state models starting from two states to nine states,

we will first see a five states model and then seven states model, as lower states models

are now obsolete.

Five State Process Model

Following are the states of a five state process model. The figure 3.1 show these state

transition.

• New State The process being created.

• Terminated State The process has finished execution.

�

• Blocked (waiting) State When a process blocks, it does so because logically it

cannot continue, typically because it is waiting for input that is not yet available.

Formally, a process is said to be blocked if it is waiting for some event to happen

(such as an I/O completion) before it can proceed. In this state a process is unable

to run until some external event happens.

• Running State A process is said to be running if it currently has the CPU,

which is, actually using the CPU at that particular instant.

• Ready State A process is said to be ready if it use a CPU if one were available.

It is run-able but temporarily stopped to let another process run.

Logically, the ‘Running’ and ‘Ready’ states are similar. In both cases the process is

willing to run, only in the case of ‘Ready’ state, there is temporarily no CPU available for

it. The ‘Blocked’ state is different from the ‘Running’ and ‘Ready’ states in that the

process cannot run, even if the CPU is available.

Following are six possible transitions among above mentioned five states

Transition 1 occurs when process discovers that it cannot continue. If running process

initiates an I/O operation before its allotted time expires, the running process voluntarily

relinquishes the CPU.

This state transition is:

Block (process): Running → Blocked.

Transition 2 occurs when the scheduler decides that the running process has run long

enough and it is time to let another process have CPU time.


Time-Run-Out (process): Running → Ready.

Transition 3 occurs when all other processes have had their share and it is time for the

first process to run again


Dispatch (process): Ready → Running.

Transition 4 occurs when the external event for which a process was waiting (such as

arrival of input) happens.


Wakeup (process): Blocked → Ready.

Transition 5 occurs when the process is created.


Admitted (process): New → Ready.

Transition 6 occurs when the process has finished execution.


Exit (process): Running → Terminated.

Swapping

Many of the operating systems follow the above shown process model. However the

operating systems which does not employ virtual memory, the processor will be idle most

of the times considering the difference between speed of I/O and processor. There will be

many processes waiting for I/O in the memory, and exhausting the memory. If there is no

ready process to run; new processes can not be created as there is no memory available to

accommodate new process. Thus the processor has to wait till any of the waiting

processes become ready after completion of an I/O operation.

This problem can be solved by adding to more states in the above process model by using

swapping technique. Swapping involves moving part or all of a process from main

memory to disk. When none of the processes in main memory is in the ready state, the

operating system swaps one of the blocked processes out onto disk in to a suspend queue.

This is a queue of existing processes that have been temporarily shifted out of main

memory, or suspended. The operating system then either creates new process or brings a

swapped process from the disk which has become ready.

Seven State Process Model

The following figure 3.2 shows the seven state process model in which uses above

described swapping technique.

Apart from the transitions we have seen in five states model, following are the new

transitions which occur in the above seven state model.

• Blocked to Blocked / Suspend: If there are now ready processes in the main

memory, at least one blocked process is swapped out to make room for another

process that is not blocked.

• Blocked / Suspend to Blocked: If a process is terminated making space in the

main memory, and if there is any high priority process which is blocked but

suspended, anticipating that it will become free very soon, the process is brought

in to the main memory.

• Blocked / Suspend to Ready / Suspend: A process is moved from Blocked /

Suspend to Ready / Suspend, if the event occurs on which the process was

waiting, as there is no space in the main memory.

• Ready / Suspend to Ready: If there are no ready processes in the main memory,

operating system has to bring one in main memory to continue the execution.

Some times this transition takes place even there are ready processes in main

memory but having lower priority than one of the processes in Ready / Suspend

state. So the high priority process is brought in the main memory.

• Ready to Ready / Suspend: Normally the blocked processes are suspended by

the operating system but sometimes to make large block free, a ready process may

be suspended. In this case normally the low priority processes are suspended.

• New to Ready / Suspend: When a new process is created, it should be added to

the Ready state. But some times sufficient memory may not be available to

allocate to the newly created process. In this case, the new process is sifted to

Ready / Suspend.

Process Control

In this section we will study structure of a process, process control block, modes of

process execution, and process switching.

Process Structure

After studying the process states now we will see where does the process reside, and what

is the physical manifestation of a process?

The location of the process depends on memory management scheme being used. In the

simplest case, a process is maintained in the secondary memory, and to manage this

process, at least small part of this process is maintained in the main memory. To execute

the process, the entire process or part of it is brought in the main memory, and for that the

operating system need to know the location of the process.

Process identification

Processor state information

Process control information

User Stack

Private user address space (program, data)

Shared address space

Figure 3.3: Process Image

The obvious contents of a process are User Program to be executed, and the User

Data which is associated with that program. Apart from these there are two major parts

of a process; System Stack, which is used to store parameters and calling addresses for

procedure and system calls, and Process Control Block, this is nothing but collection of

process attributes needed by operating system to control a process. The collection of user

program, data, system stack, and process control block is called as Process Image as

shown in the figure 3.3 above.

Process Control Block

A process control block as shown in the figure 3.4 bellow, contains various attributes

required by operating system to control a process, such as process state, program counter,

CPU state, CPU scheduling information, memory management information, I/O state

information, etc.

These attributes can be grouped in to three general categories as follows:� Process

identification

� Processor state information

� Process control information

The first category stores information related to Process identification, such as identifier

of the current process, identifier of the process which created this process, to maintain

parent-child process relationship, and user identifier, the identifier of the user on behalf

of who’s this process is being run.

The Processor state information consists of the contents of the processor registers, such

as user-visible registers, control and status registers which includes program counter and

program status word, and stack pointers.

The third category Process Control Identification is mainly required for the control of a

process. The information includes: scheduling and state information, data structuring,

inter-process communication, process privileges memory management, and resource

ownership and utilization.

pointer process state

process number

program counter

registers

memory limits

list of open files

.

.

.

Figure 3.4: Process Control Block

Modes of Execution

In order to ensure the correct execution of each process, an operating system must protect

each process’s private information (executable code, data, and stack) from uncontrolled

interferences from other processes. This is accomplished by suitably restricting the

memory address space available to a process for reading/writing, so that the OS can

regain CPU control through hardware-generated exceptions whenever a process violates

those restrictions.

Also the OS code needs to execute in a privileged condition with respect to “normal”: to

manage processes, it needs to be enabled to execute operations which are forbidden to

“normal” processes. Thus most of the processors support at least two modes of execution.

Certain instructions can only be executed in the more privileged mode. These include

reading or altering a control register such as program status word, primitive I/O

instruction; and memory management instructions.

The less privileged mode is referred as user mode as typically user programs are executed

in this mode, and the more privileged mode in which important operating system

functions are executed is called as kernel mode/ system mode or control mode.

The current mode information is stored in the PSW, i.e. whether the processor is running

in user mode or kernel mode. The mode change is normally done by executing change

mode instruction; typically after a user process invokes a system call, or whenever an

interrupt occurs, as these are operating system functions and needed to be executed in

privileged mode. After the completion of system call or interrupt routine, the mode is

again changed to user mode to continue the user process execution.

Context Switching

To give each process on a multiprogrammed machine a fair share of the CPU, a hardware

clock generates interrupts periodically. This allows the operating system to schedule all

processes in main memory (using scheduling algorithm) to run on the CPU at equal

intervals. Each time a clock interrupt occurs, the interrupt handler checks how much time

the current running process has used. If it has used up its entire time slice, then the CPU

scheduling algorithm (in kernel) picks a different process to run. Each switch of the CPU

from one process to another is called a context switch.

A context is the contents of a CPU’s registers and program counter at any point in time.

Context switching can be described as the kernel (i.e., the core of the operating system)

performing the following activities with regard to processes on the CPU: (1) suspending

the progression of one process and storing the CPU’s state (i.e., the context) for that

process somewhere in memory, (2) retrieving the context of the next process from

memory and restoring it in the CPU’s registers and (3) returning to the location indicated

by the program counter (i.e., returning to the line of code at which the process was

interrupted) in order to resume the process. The figure 3.5 bellow depicts the process of

context switch from process P0 to process P1.

Figure 3.5: Process switching

Self Assessment Questions:

1. Discuss the process state with its five state process model.

2. Explain the seven state process model.

3. What is Process Control ? Discuss the process control block.

4. Write note on Context Switching.

A context switch is sometimes described as the kernel suspending execution of one

process on the CPU and resuming execution of some other process that had previously

been suspended.

A context switch occurs due to interrupts, trap (error due to the current instruction) or a

system call as described bellow:

• Clock interrupt: when a process has executed its current time quantum which

was allocated to it, the process must be switched from running state to ready state,

and another process must be dispatched for execution.

• I/O interrupt: whenever any I/O related event occurs, the OS is interrupted, the

OS has to determine the reason of it and take necessary action for that event. Thus

the current process is switched to ready state and the interrupt routine is loaded to

do the action for the interrupt event (e.g. after an I/O interrupt the OS moves all

the processes which were blocked on the event, from blocked state to ready state,

and blocked/suspended to ready/suspended state). After completion of the

interrupt related actions, it is expected that the process which was switched,

should be brought for execution, but that does not happen. At this point the

scheduler again decides which process is to be scheduled for execution from all

the ready processes afresh. This is important as it will schedule any high priority

process present in the ready queue added during the interrupt handling period.

• Memory fault: when virtual memory technique is used for memory management,

many a times it happens that a process refers to a memory address which is not

present in the main memory, and needs to be brought in. As the memory block

transfer takes time, another process should be given chance for execution and the

current process should be blocked. Thus the OS blocks the current process, issues

an I/O request to get the memory block in the memory and switches the current

process to blocked state, and loads another process for execution.

• Trap: if the instruction being executed has any error or exception, depending on

the criticalness of the error / exception and design of operating system, it may

either move the process to exit state, or may execute the current process after a

possible recovery.

System call: many a times a process has to invoke a system call for a privileged job, for

this the current process is blocked and the respective operating system’s system call code

is executed. Thus the context of the current process is switched to the system call code.

Example: UNIX Process

Let us see an example of UNIX System V, which makes use of a simple but powerful

process facility that is highly visible to the user. The following figure shows the model

followed by UNIX, in which most of the operating system executes within the

environment of a user process. Thus, two modes, user and kernel, are required. UNIX

uses two categories of processes: system processes and user processes. System processes

run in kernel mode and execute operating system code to perform administrative and

housekeeping functions, such as allocation of memory and process swapping. User

processes operate in user mode to execute user programs and utilities and in kernel mode

to execute instructions belong to the kernel. A user process enters kernel mode by issuing

a system call, when an exception (fault) is generated or when an interrupt occurs.

A total of nine process states are recognized by the UNIX operating system as explained

bellow

• User Running: Executing in user mode.

• Kernel Running: Executing in kernel mode.

• Ready to Run, in Memory: Ready to run as soon as the kernel schedules it.

• Asleep in Memory: Unable to execute until an event occurs; process is in main

memory (a blocked state).

• Ready to Run, Swapped: Process is ready to run, but the swapper must swap the

process into main memory before the kernel can schedule it to execute.

• Sleeping, Swapped: The process is awaiting an event and has been swapped to

secondary storage (a blocked state).

• Preempted: Process is returning from kernel to user mode, but the kernel

preempts it and does a process switch to schedule another process.

• Created: Process is newly created and not yet ready to run.

• Zombie: Process no longer exists, but it leaves a record for its parent process to

collect.

UNIX employs two Running states to indicate whether the process is executing in user

mode or kernel mode. A distinction is made between the two states: (Ready to Run, in

Memory) and (Preempted). These are essentially the same state, as indicated by the

dotted line joining them. The distinction is made to emphasize the way in which the

preempted state is entered. When a process is running in kernel mode (as a result of a

supervisor call, clock interrupt, or I/O interrupt), there will come a time when the kernel

has completed its work and is ready to return control to the user program. At this point,

the kernel may decide to preempt the current process in favor of one that is ready and of

higher priority. In that case, the current process moves to the preempted state. However,

for purposes of dispatching, those processes in the preempted state and those in the

Ready to Run, in Memory state form one queue.

Preemption can only occur when a process is about to move from kernel mode to user

mode. While a process is running in kernel mode, it may not be preempted. This makes

UNIX unsuitable for real-time processing.

Two processes are unique in UNIX. Process 0 is a special process that is created when

the system boots; in effect, it is predefined as a data structure loaded at boot time. It is the

swapper process. In addition, process 0 spawns process 1, referred to as the init process;

all other processes in the system have process 1 as an ancestor. When a new interactive

user logs onto the system, it is process 1 that creates a user process for that user.

Subsequently, the user process can create child processes in a branching tree, so that any

particular application can consist of a number of related processes.

Threads

A thread is a single sequence stream within in a process. Because threads have some of

the properties of processes, they are sometimes called lightweight processes. In a process,

threads allow multiple executions of streams. In many respect, threads are popular way to

improve application through parallelism. The CPU switches rapidly back and forth

among the threads giving illusion that the threads are running in parallel. Like a

traditional process i.e., process with one thread, a thread can be in any of several states

(Running, Blocked, Ready or Terminated). Each thread has its own stack. Since thread

will generally call different procedures and thus a different execution history. This is why

thread needs its own stack. An operating system that has thread facility, the basic unit of

CPU utilization is a thread. A thread has or consists of a program counter (PC), a register

set, and a stack space. Threads are not independent of one other like processes as a result

threads shares with other threads their code section, data section, OS resources also

known as task, such as open files and signals.

Processes Vs Threads

As we mentioned earlier that in many respect threads operate in the same way as that of

processes. Some of the similarities and differences are:

Similarities

• Like processes threads share CPU and only one thread is running at a time.

• Like processes, threads within processes execute sequentially.

• Like processes, thread can create children.

• And like process, if one thread is blocked, another thread can run.

Differences

• Unlike processes, threads are not independent of one another.

• Unlike processes, all threads can access every address in the task .

• Unlike processes, threads are designed to assist one other. (Processes might or

might not assist one another because processes may originate from different

users.)

Why Threads?

Following are some reasons why we use threads in designing operating systems.

1. A process with multiple threads makes a great server for example printer server.

2. Because threads can share common data, they do not need to use interprocess

communication.

3. Because of the very nature, threads can take advantage of multiprocessors.

Threads are cheap in the sense that

1. They only need a stack and storage for registers therefore, threads are cheap to

create.

2. Threads use very little resources of an operating system in which they are

working. That is, threads do not need new address space, global data, program

code or operating system resources.

3. Context switching is fast when working with threads. The reason is that we only

have to save and/or restore PC, SP and registers.

Advantages of Threads over Multiple Processes

• Context Switching Threads are very inexpensive to create and destroy, and

they are inexpensive to represent. For example, they require space to store, the

PC, the SP, and the general-purpose registers, but they do not require space to

share memory information, Information about open files of I/O devices in use,

etc. With so little context, it is much faster to switch between threads. In other

words, it is relatively easier for a context switch using threads.

• Sharing Treads allow the sharing of a lot resources that cannot be shared in

process, for example, sharing code section, data section, Operating System

resources like open file etc.

A proxy server satisfying the requests for a number of computers on a LAN would be

benefited by a multi-threaded process. In general, any program that has to do more than

one task at a time could benefit from multitasking. For example, a program that reads

input, process it, and outputs could have three threads, one for each task.

Disadvantages of Threads over Multiple Processes

• Blocking: The major disadvantage if that if the kernel is single threaded, a system

call of one thread will block the whole process and CPU may be idle during the

blocking period.

• Security: Since there is, an extensive sharing among threads there is a potential

problem of security. It is quite possible that one thread over writes the stack of

another thread (or damaged shared data) although it is very unlikely since threads

are meant to cooperate on a single task.

Any sequential process that cannot be divided into parallel task will not benefit from

thread, as they would block until the previous one completes. For example, a program

that displays the time of the day would not benefit from multiple threads.


1. Define Thread.

2. Discuss the Process vs Threads.

3. State the advantages and disadvantages of Threads over multiple processes.

Types of Threads

There are two types of threads: user level threads (ULT) and kernel level threads (KLT).

User Level Threads

User-level threads implement in user-level libraries, rather than via systems calls, so

thread switching does not need to call operating system and to cause interrupt to the

kernel. In fact, the kernel knows nothing about user-level threads and manages them as if

they were single-threaded processes as shown in the figure 3.7 bellow.

Figure 3.7: User Level Thread

Advantages:

The most obvious advantage of this technique is that a user-level threads package can be

implemented on an Operating System that does not support threads. Some other

advantages are

• User-level thread does not require modification to operating systems.

• Simple Representation: Each thread is represented simply by a PC, registers, stack

and a small control block, all stored in the user process address space.

• Simple Management: This simply means that creating a thread, switching

between threads and synchronization between threads can all be done without

intervention of the kernel.

• Fast and Efficient: Thread switching is not much more expensive than a procedure

call.

Disadvantages:

• There is a lack of coordination between threads and operating system kernel.

Therefore, process as whole gets one time slice irrespective of whether process

has one thread or 1000 threads within. It is up to each thread to relinquish control

to other threads.

• User-level threads require non-blocking systems call i.e., a multithreaded kernel.

Otherwise, entire process will blocked in the kernel, even if there are runable

threads left in the processes. For example, if one thread causes a page fault, the

process blocks.

Kernel Level Threads:

As shown in the figure 3.8 bellow, in this method, the kernel knows about and manages

the threads. No runtime system is needed in this case. Instead of thread table in each

process, the kernel has a thread table that keeps track of all threads in the system. In

addition, the kernel also maintains the traditional process table to keep track of processes.

Operating Systems kernel provides system call to create and manage threads.

�

Figure 3.8: Kernel Level Thread

Advantages:

• Because kernel has full knowledge of all threads, Scheduler may decide to give

more time to a process having large number of threads than process having small

number of threads.

• Kernel-level threads are especially good for applications that frequently block.

Disadvantages:

• The kernel-level threads are slow and inefficient. For instance, threads operations

are hundreds of times slower than that of user-level threads.

• Since kernel must manage and schedule threads as well as processes. It requires a

full thread control block (TCB) for each thread to maintain information about

threads. As a result there is significant overhead and increased in kernel

complexity.

Thread States

As like processes, threads also go through some similar states as depicted in the figure

below. The figure only shows three main states i.e. ready, running and blocked states.

Apart from these states there are new and terminated states, very similar to the process

states.

Figure 3.9: Thread States

The only difference in thread states and processes states is that, depending on its

implementation, in a running process there may be many threads, but only one will be in

a running state and others will be in blocked or ready states. Thus a process may be

running but there may be a blocked state thread inside the thread. Also in user level

threads, a process may be blocked due to I/O request by a thread, or a process may be

switched to ready state after execution for some time, but the thread which was in

running state at the time of switch or I/O request will be in running state. Thus the

process is not in running state, but the thread within the process is in running state.


1. Write a advantages and disadvantages user level threads.

2. Write a note on Kernal level threads.

Summary

A process can be simply defined as a program in execution. Process along with program

code, comprises of program counter value, Processor register contents, values of

variables, stack and program data.

A process is created and terminated, and it follows some or all of the states of process

transition; such as New, Ready, Running, Waiting, and Exit.

A thread is a single sequence stream within a process. Because threads have some of the

properties of processes, they are sometimes called lightweight processes. There are two

types of threads: user level threads (ULT) and kernel level threads (KLT), user level

threads are mostly used on the systems where the operating system does not support

threads, but also can be combined with the kernel level threads.

Threads also have similar properties like processes e.g. execution states, context switch

etc.

Terminal Questions

1. Define process. Explain the major components of a process.

2. What are the events for process creation?

3. Explain the reasons for termination of a process.

4. Explain the process state transition with diagram.

5. Explain the event for transition of a process

1. from New to Ready

2. from Ready to Running

3. from Running to Blocked

6. What are threads?

7. State advantages and disadvantages of thread over a process.

8. What are different types of threads? Explain.

Unit 4: Memory Management :

This unit covers memory hierarchy , paging and segmentation and its paging policies.

Discussed about the cache Memory and its performance fetch and write mechanism,

replacement policy. Covers the associative memory.

Introduction

The part of the operating system which handles this responsibility is called the memory

manager. Since every process must have some amount of primary memory in order to

execute, the performance of the memory manager is crucial to the performance of the

entire system. Virtual memory refers to the technology in which some space in hard disk

is used as an extension of main memory so that a user program need not worry if its size

extends the size of the main memory.

For paging memory management, each process is associated with a page table. Each

entry in the table contains the frame number of the corresponding page in the virtual

address space of the process. This same page table is also the central data structure for

virtual memory mechanism based on paging, although more facilities are needed. It

covers the Control bits, Multi-level page table etc.

Segmentation is another popular method for both memory management and virtual

memory

Basic Cache Structure : The idea of cache memories is similar to virtual memory in that

some active portion of a low-speed memory is stored in duplicate in a higher-speed cache

memory. When a memory request is generated, the request is first presented to the cache

memory, and if the cache cannot respond, the request is then presented to main memory.

Content-Addressable Memory (CAM) is a special type of computer memory used in

certain very high speed searching applications. It is also known as associative memory,

associative storage, or associative array, although the last term is more often used for a

programming data structure.

Objectives :

At the end of this unit, you will be able to understand that :

• Memory hierarchy with strategies.

• Virtual memory and its mechanism

• Paging and Segmentation

• Replacement policy and replacement algorithms etc.

Memory Hierarchy

In addition to the responsibility of managing processes, the operating system must

efficiently manage the primary memory of the computer. The part of the operating system

which handles this responsibility is called the memory manager. Since every process

must have some amount of primary memory in order to execute, the performance of the

memory manager is crucial to the performance of the entire system. Nutt explains: “The

memory manager is responsible for allocating primary memory to processes and for

assisting the programmer in loading and storing the contents of the primary memory.

Managing the sharing of primary memory and minimizing memory access time are the

basic goals of the memory manager.”

The real challenge of efficiently managing memory is seen in the case of a system which

has multiple processes running at the same time. Since primary memory can be space-

multiplexed, the memory manager can allocate a portion of primary memory to each

process for its own use. However, the memory manager must keep track of which

processes are running in which memory locations, and it must also determine how to

allocate and de-allocate available memory when new processes are created and when old

processes complete execution. While various different strategies are used to allocate

space to processes competing for memory, three of the most popular are Best fit, Worst

fit, and First fit. Each of these strategies are described below:

• Best fit: The allocator places a process in the smallest block of unallocated

memory in which it will fit. For example, suppose a process requests 12KB of

memory and the memory manager currently has a list of unallocated blocks of

6KB, 14KB, 19KB, 11KB, and 13KB blocks. The best-fit strategy will allocate

12KB of the 13KB block to the process.

• Worst fit: The memory manager places a process in the largest block of

unallocated memory available. The idea is that this placement will create the

largest hold after the allocations, thus increasing the possibility that, compared to

best fit, another process can use the remaining space. Using the same example as

above, worst fit will allocate 12KB of the 19KB block to the process, leaving a

7KB block for future use.

• First fit: There may be many holes in the memory, so the operating system, to

reduce the amount of time it spends analyzing the available spaces, begins at the

start of primary memory and allocates memory from the first hole it encounters

large enough to satisfy the request. Using the same example as above, first fit will

allocate 12KB of the 14KB block to the process.

Notice in the diagram above that the Best fit and First fit strategies both leave a tiny

segment of memory unallocated just beyond the new process. Since the amount of

memory is small, it is not likely that any new processes can be loaded here. This

condition of splitting primary memory into segments as the memory is allocated and

deallocated is known as fragmentation. The Worst fit strategy attempts to reduce the

problem of fragmentation by allocating the largest fragments to new processes. Thus, a

larger amount of space will be left as seen in the diagram above.

Another way in which the memory manager enhances the ability of the operating system

to support multiple process running simultaneously is by the use of virtual memory.

According the Nutt, “virtual memory strategies allow a process to use the CPU when

only part of its address space is loaded in the primary memory. In this approach, each

process’s address space is partitioned into parts that can be loaded into primary memory

when they are needed and written back to secondary memory otherwise.” Another

consequence of this approach is that the system can run programs which are actually

larger than the primary memory of the system, hence the idea of “virtual memory.”

Brookshear explains how this is accomplished:

“Suppose, for example, that a main memory of 64 megabytes is required but only 32

megabytes is actually available. To create the illusion of the larger memory space, the

memory manager would divide the required space into units called pages and store the

contents of these pages in mass storage. A typical page size is no more than four

kilobytes. As different pages are actually required in main memory, the memory manager

would exchange them for pages that are no longer required, and thus the other software

units could execute as though there were actually 64 megabytes of main memory in the

machine.”

In order for this system to work, the memory manager must keep track of all the pages

that are currently loaded into the primary memory. This information is stored in a page

table maintained by the memory manager. A page fault occurs whenever a process

requests a page that is not currently loaded into primary memory. To handle page faults,

the memory manager takes the following steps:

1. The memory manager locates the missing page in secondary memory.

2. The page is loaded into primary memory, usually causing another page to be

unloaded.

3. The page table in the memory manager is adjusted to reflect the new state of the

memory.

4. The processor re-executes the instructions which caused the page fault.

1. Virtual Memory – An Introduction

In an operating system, it is possible that a program is too large to be loaded into the

main memory. In theory, a 32-bit program may have a linear space of up to 4 giga bytes,

which is larger than almost all computers nowadays. Thus we need some mechanism that

allows the execution of a process that is not completely in main memory. Overlay is one

choice. With it, the programmers have to deal with swapping in and out themselves to

make sure at any moment that the instruction to be executed next is physically in main

memory. Obviously this brings a heavy burden on the programmers. In this Unit, we

introduce another solution called virtual memory, which has been adopted by almost all

modern operating systems.

Virtual memory refers to the technology in which some space in hard disk is used as an

extension of main memory so that a user program need not worry if its size extends the

size of the main memory. If that does happen, at any time only a part of the program will

reside in main memory, and other parts will otherwise remain on hard disk and may be

switched into memory later if needed.

This mechanism is similar to the two-level memory hierarchy we once discussed before,

including cache and main memory because the principle of locality is also a basis here.

With virtual memory, if a piece of process that is needed is not in a full main memory,

then another piece will be swapped out and the former be brought in. If unfortunately, the

latter is used immediately, then it will have to loaded back into main memory right away.

As we know, the access to hard disk is time-consuming compared to the access to main

memory, Thus the reference to the virtual memory space on hard disks will deteriorate

the system performance significantly. Fortunately, the principle of locality holds. That is

the instruction and data references during a short period tend to be bounded to one piece

of process. So the access to hard disks will not be frequently requested and performed.

Thus, the same principle, on the one hand, enables the caching mechanism to increase

system performance, and on the other hand avoids the deterioration of performance with

virtual memory. With virtual memory, there must be some facility to separate a process

into several pieces so that they may reside separately either on hard disks or in main

memory. Paging or/and segmentation are two methods that are usually used to achieve

the goal.

Paging

For paging memory management, each process is associated with a page table. Each

entry in the table contains the frame number of the corresponding page in the virtual

address space of the process. This same page table is also the central data structure for

virtual memory mechanism based on paging, although more facilities are needed.

Control bits

Since only some pages of a process may be in main memory, a bit in the page table entry,

P in Figure 1(a), is used to indicate whether the corresponding page is present in main

memory or not. Another control bit needed in the page table entry is a modified bit, M,

indicating whether the content of the corresponding page have been altered or not since

the page was last loaded into main memory. We often say swapping in and swapping out,

suggesting that a process is typically separated into two parts, one residing in main

memory and the other in secondary memory, and some pages may be removed from one

part and join the other. They together make up of the whole process image. Actually the

secondary memory contains the whole image of the process and part of it may have been

loaded into main memory. When swapping out is to be performed, typically the page to

be swapped out may be simply overwritten by the new page, since the corresponding

page is already on secondary memory. However sometimes the content of a page may

have been altered at runtime, say a page containing data. In this case, the alteration

should be reflected in secondary memory. So when the M bit is 1, then the page to be

swapped out should be written out. Other bits may also be used for sharing or protection.

Multi-level page table

Typically, there is only one page table for each process, which is completely loaded into

main memory during the execution of the process. However some processes may be so

large that even its page table cannot be held fully in main memory. For example, in 32-bit

x86 architecture, each process may have up to 232 = 4G bytes of virtual memory. With

pages of 29 = 512 bytes, as many as 223 pages are needed as well as a page table of 223

entries. If each entry requires 4 bytes, that will be 225 = 32Mbytes. Thus some

mechanism is needed to allow only part of a page table is loaded in main memory.

Naturally we use paging for this. That’s page tables are subject to paging just as other

pages are, called multi-level paging. Figure 2

shows an example of a two-level scheme with a 32-bit address. If we assume 4Kbyte

pages, then 4G-byte virtual address is composed of 220 pages. If each page table entry

requires 4 bytes, then a user page table of 220 entries requires 4M bytes. This huge page

table itself needs 210 pages. For paging with it, a root page table of 210 is needed,

requiring 4K bytes.

Fig. 1: Typical memory management formats

With this two-level paging scheme, the root page table always remains in main memory.

The first 10 bits of a virtual address are used to index into the root page table to find an

entry for a page of the user page table. If that page is not in main memory, a page fault

occurs and the operating system is asked to load that page. If it is in main memory, then

the next 10 bits of the virtual address index into the user page table to find the entry for

the page that is referenced by the virtual address. This whole process is illustrated in

Figure 3.

Fig. 2: A two-level hierarchical page table

Fig. 3: Address translation in A two-level paging system

Translation lookaside buffer

As we discussed before, a translation lookaside buffer (TLB) may be used to speed up

paging and avoid frequent access to main memory, which is shown in Figure 4. With

multi-level paging scheme, the benefit of TLB will be even more significant.

Fig. 4: Use of a translation lookaside buffer

It should be noted that the TLB is a cache for a page table while the regular cache we

mentioned before is for main memory and these facilities should work together when

they are both present in a system. As figure 5 illustrates, for a virtual address consisting

of a page number and an offset address, the memory system consults the TLB first to see

if the matching page entry is present. If yes, the real address is generated by combining

the frame number with the offset. If not, the entry is accessed from a page table. Once the

real address is generated, the cache is consulted to see if the block containing that word is

present. If so, it is returned to the CPU. If not, the word is retrieved from main memory.


1. Discuss the page table with suitable example.

2. Explain the significant of control bits in paging mechanism.

3. What strategy would you followed in paging if a demanding process holds such

large size of memory space where page table can not hold in memory?

Cleaning policy

A cleaning policy is the opposite of a fetch policy. It deals with when a modified page

should be written out to secondary memory. There are two common choices:

• Demand cleaning: A page is written out only when it has been selected for

replacement.

• Pre-cleaning: Modified pages are updated on secondary memory before their

page frames are needed so that pages can be written out in batches.

• Pre-cleaning has advantage over demand cleaning but it cannot be performed too

frequently because some pages may be modified so often that frequent writing out

turns out to be unnecessary.

Frame locking

One point that is worth mentioning is that some of the frames in main memory may not

be replaced, or may be locked. For example, the frames occupied by the kernel of the

operating system, used for I/O buffers and other time-critical areas should always be

available in main memory for the operating system to operate properly. This requirement

can be satisfied by adding an additional bit in the page table.

Load control

Another related question is how many processes may be started to run and reside in main

memory simultaneously, which is called load control. Load control is critical in memory

management because, if too few processes are in main memory at any one time, it will be

very likely for all the processes to be blocked, and thus much time will be spent in

swapping. On the other hand, if too many processes exist, each individual process will be

allocated a small number of frames, and thus frequent page faulting will occur. Figure 10

shows that if all other aspects are given, there is a specific point to achieve the highest

utilization.

Fig. 10: Multiprogramming effects

Cache Memory

Basic Cache Structure

Processors are generally able to perform operations on operands faster than the access

time of large capacity main memory. Though semiconductor memory which can operate

at speeds comparable with the operation of the processor exists, it is not economical to

provide all the main memory with very high speed semiconductor memory. The problem

can be alleviated by introducing a small block of high speed memory called a cache

between the main memory and the processor.

The idea of cache memory is similar to virtual memory in that some active portion of a

low-speed memory is stored in duplicate in a higher-speed cache memory. When a

memory request is generated, the request is first presented to the cache memory, and if

the cache cannot respond, the request is then presented to main memory.

The difference between cache and virtual memory is a matter of implementation; the two

notions are conceptually the same because they both rely on the correlation properties

observed in sequences of address references. Cache implementations are totally different

from virtual memory implementation because of the speed requirements of cache.

We define a cache miss to be a reference to a item that is not resident in cache, but is

resident in main memory. The corresponding concept for cache memories is page fault,

which is defined to be a reference to a page in virtual memory that is not resident in main

memory. For cache misses, the fast memory is cache and the slow memory is main

memory. For page faults the fast memory is main memory, and the slow memory is

auxiliary memory.

Fig. 11: A cache-memory reference. The tag 0117X matches address 01173, so the cache returns the

item in the position X=3 of the matched block

A cell in memory is presented to the cache. The cache searches its directory of address

tags shown in the figure to see if the item is in the cache. If the item is not in the cache, a

miss occurs.

For READ operations that cause a cache miss, the item is retrieved from main memory

and copied into the cache. During the short period available before the main-memory

operation is complete, some other item in cache is removed form the cache to make rood

for the new item.

The cache-replacement decision is critical; a good replacement algorithm can yield

somewhat higher performance than can a bad replacement algorithm. The effective cycle-

time of a cache memory (teff) is the average of cache-memory cycle time (tcache) and main-

memory cycle time (tmain), where the probabilities in the averaging process are the

probabilities of hits and misses.

If we consider only READ operations, then a formula for the average cycle-time is:

teff = tcache + ( 1 – h ) tmain

where h is the probability of a cache hit (sometimes called the hit rate), the quantity (1 –

h), which is the probability of a miss, is know as the miss rate.

In Fig.11 we show an item in the cache surrounded by nearby items, all of which are

moved into and out of the cache together. We call such a group of data a block of the

cache.

Cache Memory Organizations

Fig. 12: The logical organization of a four-way set-associate cache

Fig. 12 shows a conceptual implementation of a cache memory. This system is called set

associative because the cache is partitioned into distinct sets of blocks, ad each set

contains a small fixed number of blocks. The sets are represented by the rows in the

figure. In this case, the cache has N sets, and each set contains four blocks. When an

access occurs to this cache, the cache controller does not search the entire cache looking

for a match. Instead, the controller maps the address to a particular set of the cache and

searches only the set for a match.

If the block is in the cache, it is guaranteed to be in the set that is searched. Hence, if the

block is not in that set, the block is not present in the cache, and the cache controller

searches no further. Because the search is conducted over four blocks, the cache is said to

be four-way set associative or, equivalently, to have an associativity of four.

Fig. 12 is only one example, there are various ways that a cache can be arranged

internally to store the cached data. In all cases, the processor reference the cache with the

main memory address of the data it wants. Hence each cache organization must use this

address to find the data in the cache if it is stored there, or to indicate to the processor

when a miss has occurred. The problem of mapping the information held in the main

memory into the cache must be totally implemented in hardware to achieve

improvements in the system operation. Various strategies are possible.

Fully associative mapping

Perhaps the most obvious way of relating cached data to the main memory address is to

store both memory address and data together in the cache. This the fully associative

mapping approach. A fully associative cache requires the cache to be composed of

associative memory holding both the memory address and the data for each cached line.

The incoming memory address is simultaneously compared with all stored addresses

using the internal logic of the associative memory, as shown in Fig. 13. If a match is

fund, the corresponding data is read out. Single words form anywhere within the main

memory could be held in the cache, if the associative part of the cache is capable of

holding a full address

Fig. 13: Cache with fully associative mapping

In all organizations, the data can be more than one word, i.e., a block of consecutive

locations to take advantage of spatial locality. In Fig. 14 aline constitutes four words,

each word being 4 bytes. The least significant part of the address selects the particular

byte, the next part selects the word, and the remaining bits form the address compared to

the address in the cache. The whole line can be transferred to and from the cache in one

transaction if there are sufficient data paths between the main memory and the cache.

With only one data word path, the words of the line have to be transferred in separate

transactions.

Fig. 14: Fully associative mapped cache with multi-word lines

The fully associate mapping cache gives the greatest flexibility of holding combinations

of blocks in the cache and minimum conflict for a given sized cache, but is also the most

expensive, due to the cost of the associative memory. It requires a replacement algorithm

to select a block to remove upon a miss and the algorithm must be implemented in

hardware to maintain a high speed of operation. The fully associative cache can only be

formed economically with a moderate size capacity. Microprocessors with small internal

caches often employ the fully associative mechanism.

I Direct mapping

The fully associative cache is expensive to implement because of requiring a comparator

with each cache location, effectively a special type of memory. In direct mapping, the

cache consists of normal high speed random access memory, and each location in the

cache holds the data, at an address in the cache given by the lower significant bits of the

main memory address. This enables the block to be selected directly from the lower

significant bits of the memory address. The remaining higher significant bits of the

address are stored in the cache with the data to complete the identification of the cached

data.

Consider the example shown in Fig. 15. The address from the processor is divided into

tow fields, a tag and an index. The tag consists of the higher significant bits of the

address, which are stored with the data. The index is the lower significant bits of the

address used to address the cache.

Figure 15: Direct Mapping

When the memory is referenced, the index is first used to access a word in the cache.

Then the tag stored in the accessed word is read and compared with the tag in the address.

If the two tags are the same, indicating that the word is the one required, access is made

to the addressed cache word. However, if the tags are not the same, indicating that the

required word is not in the cache, reference is made to the main memory to find it. For a

memory read operation, the word is then transferred into the cache where it is accessed. It

is possible to pass the information to the cache and the processor simultaneously, i.e., to

read-through the cache, on a miss. The cache location is altered for a write operation. The

main memory may be altered at the same time (write-through) or later.

Fig. 15. shows the direct mapped cache with a line consisting of more than one word. The

main memory address is composed of a tag, an index, and a word within a line. All the

words within a line in the cache have the same stored tag. The index part to the address is

used to access the cache and the stored tag is compared with required tag address. For a

read operation, if the tags are the same the word within the block is selected for transfer

to the processor. If the tags are not the same, the block containing the required word is

first transferred to the cache.

In direct mapping, the corresponding blocks with the same index in the main memory

will map into the same block in the cache, and hence only blocks with different indices

can be in the cache at the same time. A replacement algorithm is unnecessary, since there

is only one allowable location for each incoming block. Efficient replacement relies on

the low probability of lines with the same index being required. However there are such

occurrences, for example, when two data vectors are stored starting at the same index and

pairs of elements need to processed together. To gain the greatest performance, data

arrays and vectors need to be stored in a manner which minimizes the conflicts in

processing pairs of elements. Fig.6 shows the lower bits of the processor address used to

address the cache location directly. It is possible to introduce a mapping function between

the address index and the cache index so that they are not the same.

1. II Set-associative mapping

In the direct scheme, all words stored in the cache must have different indices.

The tags may be the same or different. In the fully associative scheme, blocks can

displace any other block and can be placed anywhere, but the cost of the fully

associative memories operate relatively slowly.

Set-associative mapping allows a limited number of blocks, with the same index

and different tags, in the cache and can therefore be considered as a compromise

between a fully associative cache and a direct mapped cache. The cache is divided

into “sets” of blocks. A four-way set associative cache would have four blocks in

each set. The number of blocks in a set is know as the associativity or set size.

Each block in each set has a stored tag which, together with the index, completes

the identification of the block. First, the index of the address from the processor is

used to access the set. Then, comparators are used to compare all tags of the

selected set with the incoming tag. If a match is found, the corresponding location

is accessed, other wise, as before, an access to the main memory is made.

Figure 16: Cache with set-associative mapping

The tag address bits are always chosen to be the most significant bits of the full

address, the block address bits are the next significant bits and the word/byte

address bits form the least significant bits as this spreads out consecutive man

memory blocks throughout consecutive sets in the cache. This addressing format

is known as bit selection and is used by all known systems. In a set-associative

cache it would be possible to have the set address bits as the most significant bits

of the address and the block address bits as the next significant, with the word

within the block as the least significant bits, or with the block address bits as the

least significant bits and the word within the block as the middle bits.

Notice that the association between the stored tags and the incoming tag is done

using comparators and can be shared for each associative search, and all the

information, tags and data, can be stored in ordinary random access memory. The

number of comparators required in the set-associative cache is given by the

number of blocks in a set, not the number of blocks in all, as in a fully associative

memory. The set can be selected quickly and all the blocks of the set can be read

out simultaneously with the tags before waiting for the tag comparisons to be

made. After a tag has been identified, the corresponding block can be selected.

The replacement algorithm for set-associative mapping need only consider the

lines in one set, as the choice of set is predetermined by the index in the address.

Hence, with two blocks in each set, for example, only one additional bit is

necessary in each set to identify the block to replace.

III Sector Mapping

In sector mapping, the main memory and the cache are both divided into sectors;

each sector is composed of a number of blocks. Any sector in the main memory

can map into any sector in the cache and a tag is stored with each sector in the

cache to identify the main memory sector address. However, a complete sector is

not transferred to the cache or back to the main memory as one unit. Instead,

individual blocks are transferred as required. On cache sector miss, the required

block of the sector is transferred into a specific location within one sector. The

sector location in the cache is selected and all the other existing blocks in the

sector in the cache are from a previous sector.

Sector mapping might be regarded as a fully associative mapping scheme with

valid bits, as in some microprocessor caches. Each block in the fully associative

mapped cache corresponds to a sector, and each byte corresponds to a “sector

block”.


1. Discuss the basic ideas of using the cache memory.

2. Write note on:

a. Cache Memory Organization b. Direct Mapping

3. Explain the cache with set-associative mapping with neat diagram.

Cache Performance

The performance of a cache can be quantified in terms of the hit and miss rates, the cost

of a hit, and the miss penalty, where a cache hit is a memory access that finds data in the

cache and a cache miss is one that does not.

When reading, the cost of a cache hit is roughly the time to access an entry in the cache.

The miss penalty is the additional cost of replacing a cache line with one containing the

desired data.

(Access time) = (hit cost) + (miss rate)*(miss penalty)

= (Fast memory access time) + (miss rate)*(slow memory access time)

Note that the approximation is an underestimate – control costs have been left out. Also

note that only one word is being loaded from the faster memory while a whole cache

block’s worth of data is being loaded from the slower memory.

Since the speeds of the actual memory used will be improving “independently”, most

effort in cache design is spent on fast control and decreasing the miss rates. We can

classify misses into three categories, compulsory misses, capacity misses and conflict

misses. Compulsory misses are when data is loaded into the cache for the first time (e.g.

program startup) and are unavoidable. Capacity misses are when data is reloaded because

the cache is not large enough to hold all the data no matter how we organize the data (i.e.

even if we changed the hash function and made it omniscient). All other misses are

conflict misses – there is theoretically enough space in the cache to avoid the miss but our

fast hash function caused a miss anyway.

Fetch and write mechanism

Fetch policy

We can identify three strategies for fetching bytes or blocks from the main memory to the

cache, namely:

1. Demand fetch

Which is the fetching a block when it is needed and is not already in the cache,

i.e. to fetch the required block on a miss. This strategy is the simplest and requires

no additional hardware or tags in the cache recording the references, except to

identify the block in the cache to be replaced.

Pre-fetch

Which is fetching blocks before they are requested. A simple prefetch strategy is

to prefetch the (i+1)th block when the ith block is initially referenced on the

expectation that it is likely to be needed if the ith block is needed. On the simple

prefetch strategy, not all first references will induce a miss, as some will be to

prefetched blocks.

Selective fetch

Which is the policy of not always fetching blocks, dependent upon some defined

criterion, and in these cases using the main memory rather than the cache to hold

the information. For example, shared writable data might be easier to maintain if

it is always kept in the main memory and not passed to a cache for access,

especially in multi-processor systems. Cache systems need to be designed so that

the processor can access the main memory directly and bypass the cache.

Individual locations could be tagged as non-cacheable.

Instruction and data caches

The basic stored program computer provides for one main memory for holding

both program instructions and program data. The cache can be organized in the

same fashion, with the cache holding both program instructions and data. This is

called a unified cache. We also can separate the cache into two parts: data cache

and instruction (code) cache. The general arrangement of separate caches is

shown in fig. 17. Often the cache will be integrated inside the processor chip.

Figure 17: Separate instruction and data caches

Write operations

As reading the required word in the cache does not affect the cache contents, there

can be no discrepancy between the cache word and the copy held in the main

memory after a memory read instruction. However, in general, writing can occur

to cache words and it is possible that the cache word and copy held in the main

memory may be different. It is necessary to keep the cache and the main memory

copy identical if input/output transfers operate on the main memory contents, or if

multiple processors operate on the main memory, as in a shared memory multiple

processor system.

If we ignore the overhead of maintaining consistency and the time for writing data

back to the main memory, then the average access time is given by the previous

equation, i.e. teff = tcache + ( 1 – h ) tmain , assuming that all accesses are first made

to the cache. The average access time including write operations will add

additional time to this equation that will depend upon the mechanism used to

maintain data consistency. There are two principal alternative mechanisms to

update the main memory, namely the write-through mechanism and the write-

back mechanism.

Write-through mechanism

In the write-though mechanism, every write operation to the cache is repeated to

the main memory, normally at the same time. The additional write operation to

the main memory will, of course, take much longer than to the cache and will

dominate the access time for write operations. The average access time of write-

through with transfers from main memory to the cache on all misses (read and

write) is given by:

ta = tcache + ( 1 – h ) ttrans + w(tmain - tcache)

= (1 – w) tcache + (1 – h) ttrans + wtmain

Where

ttrans

= time to transfer block to cache, assuming the whole block must be

transferred together

W = fraction of write references.

The term (tmain - tcache) is the additional time to write the word to main memory

whether a hit or a miss has occurred, given that both cache and main memory

write operation occur simultaneously but the main memory write operation must

complete before any subsequent cache read/write operation can be proceed. If the

size of the block matches the external data path size, a whole block can be

transferred in one transaction and

ttrans = tmain.

On a cache miss, a block could be transferred from the main memory to the cache

whether the miss was caused by a write or by a read operation. The term allocate

on write is used to describe a policy of bringing a word/block from the main

memory into the cache for a write operation. In write-through, fetch on write

transfers are often not done on a miss, i.e., a Non- allocate on write policy. The

information will be written back to the main memory but not kept in the cache.

The write-through scheme can be enhanced by incorporating buffers, as shown in

Fig. 18, to hold information to be written back to the main memory, freeing the

cache for subsequent accesses.

Figure 18: Cache with write buffer

For write-through, each item to be written back to the main memory is held in a

buffer together with the corresponding main memory address if the transfer

cannot be made immediately. Immediate writing to main memory when new

values are generated ensures that the most recent values are held in the main

memory and hence that any device or processor accessing the main memory

should obtain the most recent values immediately, thus avoiding the need for

complicated consistency mechanisms. There will be latency before the main

memory has been updated, and the cache and main memory values are not

consistent during this period.

2. Write-back mechanism

In the write-back mechanism, the write operation to the main memory is only

done at block replacement time. At this time, the block displaced by the incoming

block might be written back to the main memory irrespective of whether the block

has been altered. The policy is known as simple write-back, and leads to an

average access time of:

ta = tcache + ( 1 – h ) ttrans + (1 – h) ttrans

Where one (1 – h) ttrans term is due to fetching a block from memory and the other

(1 – h) ttrans term is due to writing back a block. Write-back normally handles

write misses as allocate on write, as opposed to write-through, which often

handles write misses as Non-allocate on write.

The write-back mechanism usually only writes back lines that have been altered.

To implement this policy, a 1-bit tag is associated with each cache line and is set

whenever the block is altered. At replacement time, the tags are examined to

determine whether it is necessary to write the block back to the main memory.

The average access time now becomes:

ta = tcache + ( 1 – h ) ttrans + wb(1 – h) ttrans

where wb is the probability that a block has been altered (fraction of blocks

altered). The probability that a block has been altered could be as high as the

probability of write references, w, but is likely to be much less, as more than one

write reference to the same block is likely and some references to the same

byte/word within the block are likely. However, under this policy the complete

block is written back, even if only one word in the block has been altered, and

thus the policy results in more traffic than is necessary, especially for memory

data paths narrower than a line, but still there is usually less memory traffic than

write-through, which causes every alteration to be recorded in the main memory.

The write-back scheme can also be enhanced by incorporating buffers to hold

information to be written back to the main memory, just as is possible and

normally done with write-through.


1. List and explain the various activities involved in fetch and write

mechanism.

2. When write-back mechanism is used and what its average access time.

Replacement policy

When the required word of a block is not held in the cache, we have seen that it is

necessary to transfer the block from the main memory into the cache, displacing an

existing block if the cache is full. Except for direct mapping, which does not allow a

replacement algorithm, the existing block in the cache is chosen by a replacement

algorithm. The replacement mechanism must be implemented totally in hardware,

preferably such that the selection can be made completely during the main memory cycle

for fetching the new block. Ideally, the block replaced will not be needed again in the

future. However, such future events cannot be known and a decision has to be made

based upon facts that are known at the time.

1. Random replacement algorithm

Perhaps the easiest replacement algorithm to implement is a pseudo-random

replacement algorithm. A true random replacement algorithm would select a

block to replace in a totally random order, with no regard to memory references or

previous selections; practical random replacement algorithms can approximate

this algorithm in one of several ways. For example, one counter for the whole

cache could be incremented at intervals (for example after each clock cycle, or

after each reference, irrespective of whether it is a hit or a miss). The value held in

the counter identifies the block in the cache ( if fully associative) or the block in

the set if it is a set-associative cache. The counter should have sufficient bits to

identify any block. For a fully associative cache, an n-bit counter is necessary if

there are 2n

words in the cache. For a four-way set-associative cache, one 2-bit

counter would be sufficient, together with logic to increment the counter.

2. First-in first-out replacement algorithm

The first-in first-out replacement algorithm removes the block that has been in the

cache for the longest time. The first-in first-out algorithm would naturally be

implemented with a first-in first-out queue of block address, but can be more

easily implemented with counters, only one counter for a fully associative cache

or one counter for each set in a set-associative cache, each with a sufficient

number of bits to identify the block.

3. Least recently used algorithm for a cache

In the least recently used (LRU) algorithm, the block which has not been

referenced for the longest time is removed from the cache. Only those blocks in

the cache are considered. The word “recently” comes about because the block is

not the least used, as this is likely to be back in memory. It is the least used of

those blocks in the cache, and all of those are likely to have been recently used

otherwise they would not be in the cache. The least recently used (LRU)

algorithm is popular for cache systems and can be implemented fully when the

number of blocks involved is small. There are several ways the algorithm can be

implemented in hardware for a cache, these include:

1) Counters

In the counter implementation, a counter is associated with each block. A simple

implementation would be to increment each counter at regular intervals and to

reset a counter when the associated line had been referenced. Hence the value in

each counter would indicate the age of a block since last referenced. The block

with the largest age would be replaced at replacement time.

2) Register stack

In the register stack implementation, a set of n-bit registers is formed, one for

each block in the set to be considered. The most recently used block is recorded at

the “top” of the stack and the least recently used block at the bottom. Actually, the

set of registers does not form a conventional stack, as both ends and internal

values are accessible. The value held in one register is passed to the next register

under certain conditions. When a block is referenced, starting at the top of the

stack, starting at the top of the stack, the values held in the registers are shifted

one place towards the bottom of the stack until a register is found to hold the same

value as the incoming block identification. Subsequent registers are not shifted.

The top register is loaded with the incoming block identification. This has the

effect of moving the contents of the register holding the incoming block number

to the top of the stack. This logic is fairly substantial and slow, and not really a

practical solution.

Fig. 19

3) Reference matrix

The reference matrix method centers around a matrix of status bits. There is more

than one version of the method. In one version (Smith, 1982), the upper triangular

matrix of a B X B matrix is formed without the diagonal, if there are B blocks to

consider. The triangular matrix has (B * (B – 1))/2 bits. When the ith block is

referenced, all the bits in the ith row of the matrix are set to 1 and then all the bits

in the ith column are set to 0. The least recently used block is one which has all

0’s in its row and all 1’s in its column, which can be detected easily by logic. The

method is demonstrated in Fig. 19 for

B = 4 and the reference sequence 2, 1, 3, 0, 3, 2, 1, …, together with the values

that would be obtained using a register stack.

4) Approximate methods.

When the number of blocks to consider increases above about four to eight,

approximate methods are necessary for the LRU algorithm. Fig. 20 shows a two-

stage approximation method with eight blocks, which is applicable to any

replacement algorithm. The eight blocks in Fig. 20 are divided into four pairs, and

each pair has one status bit to indicate the most/least recently used block in the

pair (simply set or reset by reference to each block). The least recently used

replacement algorithm now only considers the four pairs. Six status bits are

necessary (using the reference matrix) to identify the least recently used pair

which, together with the status bit of the pair, identifies the least recently used

block of a pair.

Figure 20: Two-stage replacement algorithm

The method can be extended to further levels. For example, sixteen blocks can be

divided into four groups, each group having two pairs. One status bit can be

associated with each pair, identifying the block in the pair, and another with each

group, identifying the group in a pair of groups. A true least recently used

algorithm is applied to the groups. In fact, the scheme could be taken to its logical

conclusion of extending to a full binary tree.

Fig. 21 gives an example. Here, there are four blocks in a set. One status bit, B0,

specifies which half o the blocks are most/least recently used. Two more bits, B1

and B2, specify which block of pairs is most/least recently used. Every time a

cache block is referenced (or loaded on a miss), the status bits are updated. For

example, if block L2 is referenced, B2 is set to a 0 to indicate that L2 is the most

recently used of the pair L2 and L3. B0 is set to a 1 to indicate that L2/L3 is the most

recently used of the four blocks, L0, L1, L2 and L3. To identify the line to replace

on a miss, the status bits are examined. If B0 = 0, then the block is either L0 or L1.

If then B1 = 0, it is L0.

Figure 21: Replacement algorithm using a tree selection


1. Discuss the various types of memory replacement algorithms in brief.

2. Write a note on:

a. Register Stack method b. Reference matrix method

Second-level caches

When the cache is integrated into the processor, it will be impossible to increase its size

should the performance not be sufficient. In any case, increasing the size of the cache

may create a slower cache. As an alternative, which has become very popular, a second

larger cache can be introduced between the first cache and the main memory as shown in

Fig. 22. This “second-level” cache is sometimes called a secondary cache.

Figure 22: Two-level caches

On a memory reference, the processor will access the first-level cache. If the information

is not found there (a first-level cache miss occurs), the second-level cache will be

accessed. If it is not in the second cache (a second-level cache miss occurs), then the

main memory must be accessed. Memory locations will be transferred to the second-level

cache and then to the first-level cache, so that two copies of a memory location will exist

in the cache system at least initially, i.e., locations cached in the second-level cache also

exist in the first-level cache. This is known as the Principle of Inclusion. (Of course the

copies of locations in the second-level cache will never be needed as they will be found

in the first-level cache.) Whether this continues will depend upon the replacement and

write policies. The replacement policy practiced in both caches would normally be the

least recently used algorithm. Normally write-through will be practiced between the

caches, which will maintain duplicate copies. The block size of the second-level cache

will be at least the same if not larger than the block size of the first-level cache, because

otherwise on a first-level cache miss, more than one second-level cache line would need

to be transferred into the first-level cache block.

Optimizing the data cache performance

When we deal with multiple arrays with some arrays accessed by rows and some by

columns, Storing the arrays row-by-row or column-by-column does not solve the

problem because both rows and columns are used in each iteration of the loop. We must

bring the same data into the cache again and again if the cache is not large enough to hold

all the data, which is a waste. We will use a matrix multiplication (C = A.B, where A, B,

and C are respectively m x p, p x n, and m x n matrices) as an example to show how to

utilize the locality to improve cache performance.

Principle of Locality

Since code is generally executed sequentially, virtually all programs repeat sections of

code and repeatedly access the same or nearby data. This characteristic is embodied in

the Principle of Locality, which has been found empirically to be obeyed by most

programs. It applies to both instruction references and data references, though it is more

likely in instruction references. It has two main aspects:

1. Temporal locality (locality in time) – individual locations, once referenced, are

likely to be referenced again in the near future.

2. Spatial locality (locality in space) – references, including the next location, are

likely to be near the last reference.

Temporal locality is found in instruction loops, data stacks and variable accesses. Spatial

locality describes the characteristic that programs access a number of distinct regions.

Sequential locality describes sequential locations being referenced and is a main attribute

of program construction. It can also be seen in data accesses, as data item are often stored

in sequential locations.

Taking advantage of temporal locality

When instructions are formed into loops which are executed many times, the length of a

loop is usually quite small. Therefore once a cache is loaded with loops of instructions

from the main memory, the instructions are used more than once before new instructions

are required from the main memory. The same situation applies to data; data is repeatedly

accessed. Suppose the reference is repeated n times in all during a program loop and after

the first reference, the location is always found in the cache, then the average access time

would be:

ta = (n*tcache + tmain)/n = tcache + tmain/n

where n = number of references. As n increases, the average access time decreases. The

increase in speed will, of course, depend upon the program. Some programs might have a

large amount of temporal locality, while others have less. We can do some optimization

about this.

Taking advantage of spatial locality

To take advantage of spatial locality, we will transfer not just one byte or word from the

main memory to the cache (and vice versa) but a series of sequential locations called a

block. We have assumed that it is necessary to reference the cache before a reference is

make to the main memory to fetch a word, and it is usual to look into the cache first to

see if the information is held there.

Data Blocking

For the matrix multiplication C = A.B, if we made code as below:

For (I = 0; I < m; I++) For (J = 0; J < n; J = J++) { R = 0; For (K = 0; K < p; K++) R = R

+ A[I][K] * B[K][J]; C[I][J] = R; }

The two inner loops read all p by n elements of B and access the same p elements in a

row of A repeatedly, and write one row of n elements of C. The number of capacity

misses clearly depends on the dimension parameters: m, n, p and the size of the cache. If

the cache can hold all three metrics, then all is well, provided there are no cache conflicts.

In the worst case, there would be (2*m*n*p + m*n) words read form memory for m*n*p

operations.

To enhance the cache performance if it is not big enough, we use an optimization

technique: blocking. The block method for this matrix product consist of:

• Split result matrix C into blocks CI,J of size Nb x Nb, each blocks is constructed

into a continuous array Cb which is then copied back into the right CI,J.

• Matrices A and B are spit into panels AI and BJ of size (Nb x p) and (p x Nb) each

panel is copied into continuous arrays Ab and Bb. The choice of Nb must ensure

that Cb, Ab and Bb fit into one level of cache, usually L2 cache.

Then we rewrite the code as:

For (I = 0; I < m/Nb; I++){ Ab = AI; For (J = 0; J < n/Nb; J++) { Bb = BJ; Cb = 0; For (K =

0; K < p/Nb; K++) Cb = Cb + AbK*BKb; CI,J = Cb; }} here “=” means assignment for matrix

We suppose for simplicity that Nb divides m, n and p. The figure 23 below may help you

in understanding operations performed on blocks. In the case of previous algorithm

matrix A is loaded only one time into cache compared to the n times access of the

original one, while matrix B is still accessed m times. This simple block method greatly

reduce memory access and real codes may choose by looking at matrix size which loop

structure (ijk vs. jik) is best appropriate and if some matrix operand fits totally into cache.

Figure 23

In the previous we do not talk about L1 cache use. In fact L1 will be generally too small to

handle a CI,J block and one panel of A and B, but remember that operation performed at

Cb = Cb + AbK*BKb is a matrix-matrix product so each operand AbK and BKb is aceessed Nb

times: this part could also use a block method. Since Nb is relatively small, the

implementation may load only one of Cb, AbK, BKb into L1 cache and works with others

from L2.

Summary

Operating system which handles the responsibility of managing the memory and t deals

with the memory management which covers the Memory hierarchy Paging and page

handling, segmentation with its policies and algorithms, Cache memory, cache memory

organization and associative mapping , cache performance. These all concepts managing

the sharing of primary and secondary memory and minimizing memory access time are

the vital goal of the memory management. And also it covers the memory fetch and

writes mechanism, replacement policy etc.

Terminal Questions

1. Memory management is important in operating systems. Discuss the main

problems that can occur if memory is managed poorly.

2. Explain the difference between logical and physical addresses.

3. Consider a paging system with a page-table stored in memory. If a memory

references takes 200 nanoseconds, how long does a paged memory reference take? If

we add associative registers, and 75 percent of all page table references are found in

the associative registers, what is the effective memory reference time? (Assume that

looking for (and maybe finding) a page-table entry in the associative memory takes

zero time).

4. Consider a demand-paging system with a paging disk that has an average access

and transfer time of 20 milliseconds. Addresses are translated through a page table in

main memory, with an access time of 1 microsecond per memory access. Thus, each

memory reference through the page table takes two accesses. To improve this time we

have added an associative memory that reduces access time to one memory reference

if the page table entry is in the associative memory. Assume that 80 percent of the

accesses are in the associative memory and that of the remaining, 10 percent (or 2

percent of the total) cause page faults. What is the effective memory access time.

5. We have discussed LRU as an attempt to predict future memory access patterns

based on previous access patterns (i.e. if we haven’t accessed a particular page in a

while, we are not likely to reference it again soon). Another idea that some

researchers have explored is to record the memory reference pattern from the last

time the program was run and use it to predict what it will access next time. Discuss

the positive and negative aspects of this idea.

Unit 5 : CPU Scheduling :

This unit covers Brief introduction of CPU scheduling, scheduling criteria and various

types of scheduling algorithms. Multiple-Processing scheduling and thread scheduling.

Introduction

Almost all programs have some alternating cycle of CPU number crunching and waiting

for I/O of some kind. (Even a simple fetch from memory takes a long time relative to

CPU speeds.). In a simple system running a single process, the time spent waiting for I/O

is wasted, and those CPU cycles are lost forever. A scheduling system allows one process

to use the CPU while another is waiting for I/O, thereby making full use of otherwise lost

CPU cycles. The challenge is to make the overall system as “efficient” and “fair” as

possible, subject to varying and often dynamic conditions, and where “efficient” and

“fair” are somewhat subjective terms, often subject to shifting priority policies.

Objective :

At the end of this unit, you will be able to understand the :� CPU-I/O Burst Cycle

� CPU Scheduler

� Scheduling Algorithms

� Multiple-Processor Scheduling

� Symmetric Multithreading

� Thread Scheduling

� Algorithm Evaluation

CPU-I/O Burst Cycle

Almost all process alternate between two states in a continuing cycle, as shown in Figure

5.1 below:

• A CPU burst of performing calculations, and

• An I/O burst, waiting for data transfer in or out of the system.

Fig. 5.1: Alternating sequence of CPU and I/O Bursts

CPU bursts vary from process to process, and from program to program, but an extensive

study shows frequency patterns similar to that shown in

Figure 5.2:

Fig. 5.2: Histogram of CPU-burst durations


1. Discuss the process alternate between two states in a continuing cycle.

2. Explain preemptive scheduling and non preemptive scheduling.

3. What is dispatcher?

CPU Scheduler

Whenever the CPU becomes idle, it is the job of the CPU Scheduler (a.k.a. the short-term

scheduler) to select another process from the ready queue to run next. The storage

structure for the ready queue and the algorithm used to select the next process are not

necessarily a FIFO queue. There are several alternatives to choose from, as well as

numerous adjustable parameters for each algorithm, which is the basic subject of this

entire unit.

Preemptive Scheduling

CPU scheduling decisions take place under one of four conditions:

1. When a process switches from the running state to the waiting state, such as for an I/O

request or invocation of the wait( ) system call.

2. When a process switches from the running state to the ready state, for example in

response to an interrupt.

3. When a process switches from the waiting state to the ready state, say at completion of

I/O or a return from wait( ).

4. When a process terminates.

For conditions 1 and 4 there is no choice – A new process must be selected. For

conditions 2 and 3 there is a choice – To either continue running the current process, or

select a different one. If scheduling takes place only under conditions 1 and 4, the system

is said to be non-preemptive, or cooperative. Under these conditions, once a process

starts running it keeps running, until it either voluntarily blocks or until it finishes.

Otherwise the system is said to be preemptive. Windows used non-preemptive scheduling

up to Windows 3.x, and started using pre-emptive scheduling with Win95. Macs used

non-preemptive prior to OSX, and pre-emptive since then. Note that pre-emptive

scheduling is only possible on hardware that supports a timer interrupt. It is to be noted

that pre-emptive scheduling can cause problems when two processes share data, because

one process may get interrupted in the middle of updating shared data structures.

Preemption can also be a problem if the kernel is busy implementing a system call (e.g.

updating critical kernel data structures) when the preemption occurs. Most modern

UNIXes deal with this problem by making the process wait until the system call has

either completed or blocked before allowing the preemption Unfortunately this solution is

problematic for real-time systems, as real-time response can no longer be guaranteed.

Some critical sections of code protect themselves from concurrency problems by

disabling interrupts before entering the critical section and re-enabling interrupts on

exiting the section. Needless to say, this should only be done in rare situations, and only

on very short pieces of code that will finish quickly, ( usually just a few machine

instructions. )

Dispatcher

The dispatcher is the module that gives control of the CPU to the process selected by the

scheduler. This function involves:� Switching context.

� Switching to user mode.

� Jumping to the proper location in the newly loaded program.

The dispatcher needs to be as fast as possible, as it is run on every context switch. The

time consumed by the dispatcher is known as dispatch latency.

Scheduling Criteria

There are several different criteria to consider when trying to select the “best” scheduling

algorithm for a particular situation and environment, including:� CPU utilization –

Ideally the CPU would be busy 100% of the time, so as to waste 0 CPU cycles. On a real

system CPU usage should range from 40% (lightly loaded) to 90% (heavily

loaded.)� Throughput – Number of processes completed per unit time. May range from

10 / second to 1 / hour depending on the specific processes.� Turnaround time – Time

required for a particular process to complete, from submission time to completion. (Wall

clock time.)� Waiting time – How much time processes spend in the ready queue

waiting their turn to get on the CPU.

� (Load average – The average number of processes sitting in the ready queue waiting

their turn to get into the CPU. Reported in 1-minute, 5-minute, and 15-minute averages

by “uptime” and “who”.)� Response time – The time taken in an interactive program

from the issuance of a command to the commence of a response to that command.

In general one wants to optimize the average value of a criteria (Maximize CPU

utilization and throughput, and minimize all the others.) However some times one wants

to do something different, such as to minimize the maximum response time. Sometimes it

is most desirable to minimize the variance of a criteria than the actual value. I.e. users are

more accepting of a consistent predictable system than an inconsistent one, even if it is a

little bit slower.

Scheduling Algorithms

The following subsections will explain several common scheduling strategies, looking at

only a single CPU burst each for a small number of processes. Obviously real systems

have to deal with a lot more simultaneous processes executing their CPU-I/O burst

cycles.

First-Come First-Serve Scheduling, FCFS

FCFS is very simple – Just a FIFO queue, like customers waiting in line at the bank or

the post office or at a copying machine. Unfortunately, however, FCFS can yield some

very long average wait times, particularly if the first process to get there takes a long

time. For example, consider the following three processes:

Process Burst Time

P1 24

P2 3

P3 3

In the first Gantt chart below, process P1 arrives first. The average waiting time for the

three processes is (0 + 24 + 27) / 3 = 17.0 ms. In the second Gantt chart below, the same

three processes have an average wait time of

(0 + 3 + 6) / 3 = 3.0 ms. The total run time for the three bursts is the same, but in the

second case two of the three finish much quicker, and the other process is only delayed

by a short amount.

FCFS can also block the system in a busy dynamic system in another way, known as the

convoy effect. When one CPU intensive process blocks the CPU, a number of I/O

intensive processes can get backed up behind it, leaving the I/O devices idle. When the

CPU hog finally relinquishes the CPU, then the I/O processes pass through the CPU

quickly, leaving the CPU idle while everyone queues up for I/O, and then the cycle

repeats itself when the CPU intensive process gets back to the ready queue.

Shortest-Job-First Scheduling, SJF

The idea behind the SJF algorithm is to pick the quickest fastest little job that needs to be

done, get it out of the way first, and then pick the next smallest fastest job to do next.

(Technically this algorithm picks a process based on the next shortest CPU burst, not the

overall process time.). For example, the Gantt chart below is based upon the following

CPU burst times, (and the assumption that all jobs arrive at the same time.)

Process Burst Time

P1 6

P2 8

P3 7

P4 3

In the case above the average wait time is (0 + 3 + 9 + 16) / 4 = 7.0 ms, (as opposed to

10.25 ms for FCFS for the same processes.)

SJF can be proven to be the fastest scheduling algorithm, but it suffers from one

important problem: How do you know how long the next CPU burst is going to be?� For

long-term batch jobs this can be done based upon the limits that users set for their jobs

when they submit them, which encourages them to set low limits, but risks their having to

re-submit the job if they set the limit too low. However that does not work for short-term

CPU scheduling on an interactive system.� Another option would be to statistically

measure the run time characteristics of jobs, particularly if the same tasks are run

repeatedly and predictably. But once again that really isn’t a viable option for short term

CPU scheduling in the real world.� A more practical approach is to predict the length of

the next burst, based on some historical measurement of recent burst times for this

process. One simple, fast, and relatively accurate method is the exponential average,

which can be defined as follows.

estimate[ i + 1 ] = alpha * burst[ i ] + ( 1.0 – alpha ) * estimate[ i ]� In this scheme the

previous estimate contains the history of all previous times, and alpha serves as a

weighting factor for the relative importance of recent data versus past history. If alpha is

1.0, then past history is ignored, and we assume the next burst will be the same length as

the last burst. If alpha is 0.0, then all measured burst times are ignored, and we just

assume a constant burst time. Most commonly alpha is set at 0.5, as illustrated in Figure

5.3:

Fig. 5.3: Prediction of the length of the next CPU burst

SJF can be either preemptive or non-preemptive. Preemption occurs when a new process

arrives in the ready queue that has a predicted burst time shorter than the time remaining

in the process whose burst is currently on the CPU. Preemptive SJF is sometimes referred

to as shortest remaining time first scheduling. For example, the following Gantt chart is

based upon the following data:

Process Arrival Time Burst Time

P1 0 8

P2 1 4

P3 2 9

p4 3 5

The average wait time in this case is ( (5 – 3) + (10 – 1) + (17 – 2)) / 4 = 26 / 4 = 6.5 ms.

(As opposed to 7.75 ms for non-preemptive SJF or 8.75 for FCFS.)

Priority Scheduling

Priority scheduling is a more general case of SJF, in which each job is assigned a priority

and the job with the highest priority gets scheduled first. (SJF uses the inverse of the next

expected burst time as its priority – The smaller the expected burst, the higher the

priority.)

Note that in practice, priorities are implemented using integers within a fixed range, but

there is no agreed-upon convention as to whether “high” priorities use large numbers or

small numbers. This book uses low number for high priorities, with 0 being the highest

possible priority. For example, the following Gantt chart is based upon these process

burst times and priorities, and yields an average waiting time of 8.2 ms:

Process Burst Time Priority

P1 10 3

P2 1 1

P3 2 4

P4 1 5

P5 5 2

Priorities can be assigned either internally or externally. Internal priorities are assigned

by the OS using criteria such as average burst time, ratio of CPU to I/O activity, system

resource use, and other factors available to the kernel. External priorities are assigned by

users, based on the importance of the job, fees paid, politics, etc. Priority scheduling can

be either preemptive or non-preemptive. Priority scheduling can suffer from a major

problem known as indefinite blocking, or starvation, in which a low-priority task can

wait forever because there are always some other jobs around that have higher

priority.� If this problem is allowed to occur, then processes will either run eventually

when the system load lightens (at say 2:00 a.m.), or will eventually get lost when the

system is shut down or crashes. (There are rumors of jobs that have been stuck for

years.)� One common solution to this problem is aging, in which priorities of jobs

increase the longer they wait. Under this scheme a low-priority job will eventually get its

priority raised high enough that it gets run.

Round Robin Scheduling

Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are

assigned with limits called time quantum. When a process is given the CPU, a timer is

set for whatever value has been set for a time quantum.� If the process finishes its burst

before the time quantum timer expires, then it is swapped out of the CPU just like the

normal FCFS algorithm.� If the timer goes off first, then the process is swapped out of

the CPU and moved to the back end of the ready queue.

The ready queue is maintained as a circular queue, so when all processes have had a turn,

then the scheduler gives the first process another turn, and so on. RR scheduling can give

the effect of all processors sharing the CPU equally, although the average wait time can

be longer than with other scheduling algorithms. In the following example the average

wait time is 5.66 ms.

Process Burst Time

P1 24

P2 3

P3 3

The performance of RR is sensitive to the time quantum selected. If the quantum is large

enough, then RR reduces to the FCFS algorithm; If it is very small, then each process

gets 1/nth of the processor time and share the CPU equally.

BUT, a real system invokes overhead for every context switch, and the smaller the time

quantum the more context switches there are. (See Figure 5.4 below.) Most modern

systems use time quantum between 10 and 100 milliseconds, and context switch times on

the order of 10 microseconds, so the overhead is small relative to the time quantum.

Fig. 5.4: The way in which a smaller time quantum increases context switches

Turn around time also varies with quantum time, in a non-apparent manner. Consider, for

example the processes shown in Figure 5.5:

Fig. 5.5: The way in which turnaround time varies with the time quantum

In general, turnaround time is minimized if most processes finish their next cpu burst

within one time quantum. For example, with three processes of

10 ms bursts each, the average turnaround time for 1 ms quantum is 29, and for 10 ms

quantum it reduces to 20. However, if it is made too large, then RR just degenerates to

FCFS. A rule of thumb is that 80% of CPU bursts should be smaller than the time

quantum.

Multilevel Queue Scheduling

When processes can be readily categorized, then multiple separate queues can be

established, each implementing whatever scheduling algorithm is most appropriate for

that type of job, and/or with different parametric adjustments. Scheduling must also be

done between queues, that is scheduling one queue to get time relative to other queues.

Two common options are strict priority (no job in a lower priority queue runs until all

higher priority queues are empty) and round-robin (each queue gets a time slice in turn,

possibly of different sizes.)

Note that under this algorithm jobs cannot switch from queue to queue – Once they are

assigned a queue, that is their queue until they finish.

Fig. 5.6: Multilevel queue scheduling

Multilevel Feedback-Queue Scheduling

Multilevel feedback queue scheduling is similar to the ordinary multilevel queue

scheduling described above, except jobs may be moved from one queue to another for a

variety of reasons:

� If the characteristics of a job change between CPU-intensive and I/O intensive, then it

may be appropriate to switch a job from one queue to another.

� Aging can also be incorporated, so that a job that has waited for a long time can get

bumped up into a higher priority queue for a while.

Multilevel feedback queue scheduling is the most flexible, because it can be tuned for

any situation. But it is also the most complex to implement because of all the adjustable

parameters. Some of the parameters which define one of these systems include:� The

number of queues.� The scheduling algorithm for each queue.� The methods used to

upgrade or demote processes from one queue to another. ( Which may be different.

)� The method used to determine which queue a process enters initially.

Fig. 5.7: Multilevel feedback queues


1. Explain the several common scheduling strategies in brief.

2. Explain the FCFS scheduling with a suitable example.

3. Write note on:

a. Priority Scheduling b. RR Scheduling

Multiple-Processor Scheduling

When multiple processors are available, then the scheduling gets more complicated,

because now there is more than one CPU which must be kept busy and in effective use at

all times. Load sharing revolves around balancing the load between multiple processors.

Multi-processor systems may be heterogeneous, (different kinds of CPUs), or

homogenous, (all the same kind of CPU). Even in the latter case there may be special

scheduling constraints, such as devices which are connected via a private bus to only one

of the CPUs. This book will restrict its discussion to homogenous systems.

Approaches to Multiple-Processor Scheduling

One approach to multi-processor scheduling is asymmetric multiprocessing, in which

one processor is the master, controlling all activities and running all kernel code, while

the other runs only user code. This approach is relatively simple, as there is no need to

share critical system data. Another approach is symmetric multiprocessing, SMP, where

each processor schedules its own jobs, either from a common ready queue or from

separate ready queues for each processor. Virtually all modern OSes support SMP,

including XP, Win 2000, Solaris, Linux, and Mac OSX.

Processor Affinity

Processors contain cache memory, which speeds up repeated accesses to the same

memory locations. If a process were to switch from one processor to another each time it

got a time slice, the data in the cache (for that process) would have to be invalidated and

re-loaded from main memory, thereby obviating the benefit of the cache. Therefore SMP

systems attempt to keep processes on the same processor, via processor affinity.

Soft affinity occurs when the system attempts to keep processes on the same processor

but makes no guarantees. Linux and some other OSes support hard affinity, in which a

process specifies that it is not to be moved between processors.

Load Balancing

Obviously an important goal in a multiprocessor system is to balance the load between

processors, so that one processor won’t be sitting idle while another is overloaded.

Systems using a common ready queue are naturally self-balancing, and do not need any

special handling. Most systems, however, maintain separate ready queues for each

processor.

Balancing can be achieved through either push migration or pull migration:

• Push migration involves a separate process that runs periodically,

(e.g. every 200 milliseconds), and moves processes from heavily loaded

processors onto less loaded ones.

• Pull migration involves idle processors taking processes from the ready queues of

other processors.

• Push and pull migration are not mutually exclusive.

Note that moving processes from processor to processor to achieve load balancing works

against the principle of processor affinity, and if not carefully managed, the savings

gained by balancing the system can be lost in rebuilding caches. One option is to only

allow migration when imbalance surpasses a given threshold.

Symmetric Multithreading

An alternative strategy to SMP is SMT, Symmetric Multi-Threading, in which multiple

virtual (logical) CPUs are used instead of (or in combination with) multiple physical

CPUs. SMT must be supported in hardware, as each logical CPU has its own registers

and handles its own interrupts. (Intel refers to SMT as hyperthreading technology.) To

some extent the OS does not need to know if the processors it is managing are real or

virtual. On the other hand, some scheduling decisions can be optimized if the scheduler

knows the mapping of virtual processors to real CPUs. (Consider the scheduling of two

CPU-intensive processes on the architecture shown below.)

Fig. 5.8: A typical SMT architecture

Thread Scheduling

The process scheduler schedules only the kernel threads. User threads are mapped to

kernel threads by the thread library – The OS (and in particular the scheduler) is unaware

of them.

Contention Scope

Contention scope refers to the scope in which threads compete for the use of physical

CPUs. On systems implementing many-to-one and many-to-many threads, Process

Contention Scope, PCS, occurs, because competition occurs between threads that are

part of the same process.

(This is the management / scheduling of multiple user threads on a single kernel thread,

and is managed by the thread library.)

System Contention Scope, SCS, involves the system scheduler scheduling kernel threads

to run on one or more CPUs. Systems implementing one-to-one threads (XP, Solaris 9,

Linux), use only SCS. PCS scheduling is typically done with priority, where the

programmer can set and/or change the priority of threads created by his or her programs.

Even time slicing is not guaranteed among threads of equal priority.

Pthread Scheduling

The Pthread library provides for specifying scope contention:

• PTHREAD_SCOPE_PROCESS schedules threads using PCS, by scheduling user

threads onto available LWPs using the many-to-many model.

• PTHREAD_SCOPE_SYSTEM schedules threads using SCS, by binding user

threads to particular LWPs, effectively implementing a one-to-one model.

Getscope and setscope methods provide for determining and setting the scope contention

respectively:

Fig. 5.9: Pthread Scheduling API

Operating System Examples

Example: Solaris Scheduling

• Priority-based kernel thread scheduling.

• Four classes (real-time, system, interactive, and time-sharing), and multiple

queues / algorithms within each class.

• Default is time-sharing.

o Process priorities and time slices are adjusted dynamically in a multilevel-

feedback priority queue system.

o Time slices are inversely proportional to priority – Higher priority jobs get

smaller time slices.

o Interactive jobs have higher priority than CPU-Bound ones.

o See the table below for some of the 60 priority levels and how they shift.

“Time quantum expired” and “return from sleep” indicate the new priority

when those events occur.

Fig. 5.10: Solaries scheduling

Fig. 5.11: Solaries dispatch table for interactive and time-sharing threads

Solaris 9 introduced two new scheduling classes: Fixed priority and fair share.

• Fixed priority is similar to time sharing, but not adjusted dynamically.

• Fair share uses shares of CPU time rather than priorities to schedule jobs. A

certain share of the available CPU time is allocated to a project, which is a set of

processes.

System class is reserved for kernel use. (User programs running in kernel mode are NOT

considered in the system scheduling class.)

Fig. 5.13: Windows XP priorities

Fig. 5.14: List of tasks indexed according to priority

Algorithm Evaluation

The first step in determining which algorithm (and what parameter settings within that

algorithm) is optimal for a particular operating environment is to determine what criteria

are to be used, what goals are to be targeted, and what constraints if any must be applied.

For example, one might want to “maximize CPU utilization, subject to a maximum

response time of

1 second”.

Once criteria have been established, then different algorithms can be analyzed and a “best

choice” determined. The following sections outline some different methods for

determining the “best choice”.

Deterministic Modeling

If a specific workload is known, then the exact values for major criteria can be fairly

easily calculated, and the “best” determined. For example, consider the following

workload (with all processes arriving at time 0), and the resulting schedules determined

by three different algorithms:

Process Burst Time

P1 10

P2 29

P3 3

P4 7

P5 12

The average waiting times for FCFS, SJF, and RR are 28ms, 13ms, and 23ms

respectively. Deterministic modeling is fast and easy, but it requires specific known

input, and the results only apply for that particular set of input. However by examining

multiple similar cases, certain trends can be observed. (Like the fact that for processes

arriving at the same time, SJF will always yield the shortest average wait time.)

Queuing Models

Specific process data is often not available, particularly for future times. However a study

of historical performance can often produce statistical descriptions of certain important

parameters, such as the rate at which new processes arrive, the ratio of CPU bursts to I/O

times, the distribution of CPU burst times and I/O burst times, etc.

Armed with those probability distributions and some mathematical formulas, it is

possible to calculate certain performance characteristics of individual waiting queues. For

example, Little’s Formula says that for an average queue length of N, with an average

waiting time in the queue of W, and an average arrival of new jobs in the queue of

Lambda, then these three terms can be related by:

N = Lambda * W

Queuing models treat the computer as a network of interconnected queues, each of which

is described by its probability distribution statistics and formulas such as Little’s formula.

Unfortunately real systems and modern scheduling algorithms are so complex as to make

the mathematics intractable in many cases with real systems.

Simulations

Another approach is to run computer simulations of the different proposed algorithms

(and adjustment parameters) under different load conditions, and to analyze the results to

determine the “best” choice of operation for a particular load pattern. Operating

conditions for simulations are often randomly generated using distribution functions

similar to those described above. A better alternative when possible is to generate trace

tapes, by monitoring and logging the performance of a real system under typical expected

work loads. These are better because they provide a more accurate picture of system

loads, and also because they allow multiple simulations to be run with the identical

process load, and not just statistically equivalent loads. A compromise is to randomly

determine system loads and then save the results into a file, so that all simulations can be

run against identical randomly determined system loads.

Although trace tapes provide more accurate input information, they can be difficult and

expensive to collect and store, and their use increases the complexity of the simulations

significantly. There is also some question as to whether the future performance of the

new system will really match the past performance of the old system. (If the system runs

faster, users may take fewer coffee breaks, and submit more processes per hour than

under the old system. Conversely if the turnaround time for jobs is longer, intelligent

users may think more carefully about the jobs they submit rather than randomly

submitting jobs and hoping that one of them works out.)

Fig. 5.15: Evaluation of CPU schedulers by simulation

Implementation

The only real way to determine how a proposed scheduling algorithm is going to operate

is to implement it on a real system. For experimental algorithms and those under

development, this can cause difficulties and resistances among users who don’t care

about developing OS’s and are only trying to get their daily work done. Even in this case,

the measured results may not be definitive, for at least two major reasons: (1) System

work loads are not static, but change over time as new programs are installed, new users

are added to the system, new hardware becomes available, new work projects get started,

and even societal changes. (For example the explosion of the Internet has drastically

changed the amount of network traffic that a system sees and the importance of handling

it with rapid response times.) (2) As mentioned above, changing the scheduling system

may have an impact on the work load and the ways in which users use the system.

Most modern systems provide some capability for the system administrator to adjust

scheduling parameters, either on the fly or as the result of a reboot or a kernel rebuild.

Summary

The summary of this unit covers the alternating sequence of CPU I/O bursts. CPU

scheduler in this there are several alternatives to choose from, as well as numerous

adjustable parameters for each specified scheduling algorithms. In this discussed various

common scheduling strategies, such as FCFS Scheduling, Shortest-Job-First scheduling,

priority scheduling, RR-scheduling and Multilevel queue scheduling and Multiple-

Processor scheduling. Finally we also discussed about the load balancing, thread

scheduling and various algorithm evaluation models.

Terminal Questions

1. What do you understand by scheduling process what are the conditions which

guides during the CPU scheduling decisions?

2. What is the significance of dispatcher module in scheduling process? Explain the

dispatcher latency.

3. What are the various scheduling algorithms discuss the advantages of one over the

other.

4. When it is advisable to follow the priority scheduling approach, what is the

suggested solution to deal with starvation problem in this approach.

5. What is load balancing? How load balancing is achieved in multiprocessor

systems.

Unit 6 : Deadlocks:

This unit covers the deadlock principles, deadlock detection and recovery, deadlock

avoidance , prevention, pipes.

Introduction

Recall that one definition of an operating system is a resource allocator. There are many

resources that can be allocated to only one process at a time, and we have seen several

operating system features that allow this, such as mutexes, semaphores or file locks.

Sometimes a process has to reserve more than one resource. For example, a process

which copies files from one tape to another generally requires two tape drives. A process

which deals with databases may need to lock multiple records in a database.

A deadlock is a situation in which two computer programs sharing the same resource are

effectively preventing each other from accessing the resource, resulting in both programs

ceasing to function.

The earliest computer operating systems ran only one program at a time. All of the

resources of the system were available to this one program. Later, operating systems ran

multiple programs at once, interleaving them. Programs were required to specify in

advance what resources they needed so that they could avoid conflicts with other

programs running at the same time. Eventually some operating systems offered dynamic

allocation of resources. Programs could request further allocations of resources after they

had begun running. This led to the problem of the deadlock. Here is the simplest

example:

Program 1 requests resource A and receives it.

Program 2 requests resource B and receives it. Program 1 requests resource B and is queued up, pending the release of B.

Program 2 requests resource A and is queued up, pending the release of A.

Now neither program can proceed until the other program releases a resource. The

operating system cannot know what action to take. At this point the only alternative is to

abort (stop) one of the programs.

Learning to deal with deadlocks had a major impact on the development of operating

systems and the structure of databases. Data was structured and the order of requests was

constrained in order to avoid creating deadlocks.

In general, resources allocated to a process are not preemptable; this means that once a

resource has been allocated to a process, there is no simple mechanism by which the

system can take the resource back from the process unless the process voluntarily gives it

up or the system administrator kills the process. This can lead to a situation called

deadlock. A set of processes or threads is deadlocked when each process or thread is

waiting for a resource to be freed which is controlled by another process. Here is an

example of a situation where deadlock can occur.

Mutex M1, M2;

/* Thread 1 */

while (1) {

NonCriticalSection() Mutex_lock(&M1);

Mutex_lock(&M2);

CriticalSection(); Mutex_unlock(&M2);

Mutex_unlock(&M1);

}

/* Thread 2 */ while (1) {

NonCriticalSection()

Mutex_lock(&M2); Mutex_lock(&M1);

CriticalSection();

Mutex_unlock(&M1);

Mutex_unlock(&M2);

}

Suppose thread 1 is running and locks M1, but before it can lock M2, it is interrupted.

Thread 2 starts running; it locks M2, when it tries to obtain and lock M1, it is blocked

because M1 is already locked (by thread 1). Eventually thread 1 starts running again, and

it tries to obtain and lock M2, but it is blocked because M2 is already locked by thread 2.

Both threads are blocked; each is waiting for an event which will never occur.

Traffic gridlock is an everyday example of a deadlock situation.

In order for deadlock to occur, four conditions must be true.

• Mutual exclusion – Each resource is either currently allocated to exactly one

process or it is available. (Two processes cannot simultaneously control the same

resource or be in their critical section).

• Hold and Wait – processes currently holding resources can request new resources

• No preemption – Once a process holds a resource, it cannot be taken away by

another process or the kernel.

• Circular wait – Each process is waiting to obtain a resource which is held by

another process.

The dining philosophers problem discussed in an earlier section is a classic example of

deadlock. Each philosopher picks up his or her left fork and waits for the right fork to

become available, but it never does.

Deadlock can be modeled with a directed graph. In a deadlock graph, vertices represent

either processes (circles) or resources (squares). A process which has acquired a resource

is show with an arrow (edge) from the resource to the process. A process which has

requested a resource which has not yet been assigned to it is modeled with an arrow from

the process to the resource. If these create a cycle, there is deadlock.

The deadlock situation in the above code can be modeled like this.

This graph shows an extremely simple deadlock situation, but it is also possible for a

more complex situation to create deadlock. Here is an example of deadlock with four

processes and four resources.

There are a number of ways that deadlock can occur in an operating situation. We have

seen some examples, here are two more.

• Two processes need to lock two files, the first process locks one file the second

process locks the other, and each waits for the other to free up the locked file.

• Two processes want to write a file to a print spool area at the same time and both

start writing. However, the print spool area is of fixed size, and it fills up before

either process finishes writing its file, so both wait for more space to become

available.

Objective :

At the end of this unit, you will be able to understand the :

• Solutions to deadlock

• Deadlock detection and recovery

• Deadlock avoidance

• Deadlock Prevention

• Pipes

Solutions to deadlock

There are several ways to address the problem of deadlock in an operating system.

• Just ignore it and hope it doesn’t happen

• Detection and recovery – if it happens, take action

• Dynamic avoidance by careful resource allocation. Check to see if a resource can

be granted, and if granting it will cause deadlock, don’t grant it.

• Prevention – change the rules

Ignore deadlock

The text refers to this as the Ostrich Algorithm. Just hope that deadlock doesn’t happen.

In general, this is a reasonable strategy. Deadlock is unlikely to occur very often; a

system can run for years without deadlock occurring. If the operating system has a

deadlock prevention or detection system in place, this will have a negative impact on

performance (slow the system down) because whenever a process or thread requests a

resource, the system will have to check whether granting this request could cause a

potential deadlock situation.

If deadlock does occur, it may be necessary to bring the system down, or at least

manually kill a number of processes, but even that is not an extreme solution in most

situations.

Deadlock detection and recovery

As we saw above, if there is only one instance of each resource, it is possible to detect

deadlock by constructing a resource allocation/request graph and checking for cycles.

Graph theorists have developed a number of algorithms to detect cycles in a graph. The

book discusses one of these. It uses only one data structure L a list of nodes.

A cycle detection algorithm

For each node N in the graph

1. Initialize L to the empty list and designate all edges as unmarked

2. Add the current node to L and check to see if it appears twice. If it does, there is a

cycle in the graph.

3. From the given node, check to see if there are any unmarked outgoing edges. If

yes, go to the next step, if no, skip the next step

4. Pick an unmarked edge, mark it, then follow it to the new current node and go to

step 3.

5. We have reached a dead end. Go back to the previous node and make that the

current node. If the current node is the starting Node and there are no unmarked

edges, there are no cycles in the graph. Otherwise, go to step 3.

Let’s work through an example with five processes and five resources. Here is the

resource request/allocation graph.

The algorithm needs to search each node; let’s start at node P1. We add P1 to L and

follow the only edge to R1, marking that edge. R1 is now the current node so we add that

to L, checking to confirm that it is not already in L. We then follow the unmarked edge to

P2, marking the edge, and making P2 the current node. We add P2 to L, checking to

make sure that it is not already in L, and follow the edge to R2. This makes R2 the

current node, so we add it to L, checking to make sure that it is not already there. We are

now at a dead end so we back up, making P2 the current node again. There are no more

unmarked edges from P2 so we back up yet again, making R1 the current node. There are

no more unmarked edges from R1 so we back up yet again, making P1 the current node.

Since there are no more unmarked edges from P1 and since this was our starting point,

we are through with this node (and all of the nodes visited so far).

We move to the next unvisited node P3, and initialize L to empty. We first follow the

unmarked edge to R1, putting R1 on L. Continuing, we make P2 the current node and

then R2. Since we are at a dead end, we repeatedly back up until P3 becomes the current

node again.

L now contains P3, R1, P2, and R2. P3 is the current node, and it has another unmarked

edge to R3. We make R3 the current node, add it to L, follow its edge to P4. We repeat

this process, visiting R4, then P5, then R5, then P3. When we visit P3 again we note that

it is already on L, so we have detected a cycle, meaning that there is a deadlock situation.

Once deadlock has been detected, it is not clear what the system should do to correct the

situation. There are three strategies.

• Preemption – we can take an already allocated resource away from a process and

give it to another process. This can present problems. Suppose the resource is a

printer and a print job is half completed. It is often difficult to restart such a job

without completely starting over.

• Rollback – In situations where deadlock is a real possibility, the system can

periodically make a record of the state of each process and when deadlock occurs,

roll everything back to the last checkpoint, and restart, but allocating resources

differently so that deadlock does not occur. This means that all work done after

the checkpoint is lost and will have to be redone.

• Kill one or more processes – this is the simplest and crudest, but it works.

Deadlock avoidance

The above solution allowed deadlock to happen, then detected that deadlock had occurred

and tried to fix the problem after the fact. Another solution is to avoid deadlock by only

granting resources if granting them cannot result in a deadlock situation later. However,

this works only if the system knows what requests for resources a process will be making

in the future, and this is an unrealistic assumption. The text describes the bankers

algorithm but then points out that it is essentially impossible to implement because of this

assumption.

Deadlock Prevention

The difference between deadlock avoidance and deadlock prevention is a little subtle.

Deadlock avoidance refers to a strategy where whenever a resource is requested, it is only

granted if it cannot result in deadlock. Deadlock prevention strategies involve changing

the rules so that processes will not make requests that could result in deadlock.

Here is a simple example of such a strategy. Suppose every possible resource is

numbered (easy enough in theory, but often hard in practice), and processes must make

their requests in order; that is, they cannot request a resource with a number lower than

any of the resources that they have been granted so far. Deadlock cannot occur in this

situation.

As an example, consider the dining philosophers problem. Suppose each chopstick is

numbered, and philosophers always have to pick up the lower numbered chopstick before

the higher numbered chopstick. Philosopher five picks up chopstick 4, philosopher 4

picks up chopstick 3, philosopher 3 picks up chopstick 2, philosopher 2 picks up

chopstick 1. Philosopher 1 is hungry, and without this assumption, would pick up

chopstick 5, thus causing deadlock. However, if the lower number rule is in effect, he/she

has to pick up chopstick 1 first, and it is already in use, so he/she is blocked. Philosopher

5 picks up chopstick 5, eats, and puts both down, allows philosopher 4 to eat. Eventually

everyone gets to eat.

An alternative strategy is to require all processes to request all of their resources at once,

and either all are granted or none are granted. Like the above strategy, this is

conceptually easy but often hard to implement in practice because it assumes that a

process knows what resources it will need in advance.

Livelock

There is a variant of deadlock called livelock. This is a situation in which two or more

processes continuously change their state in response to changes in the other process(es)

without doing any useful work. This is similar to deadlock in that no progress is made but

differs in that neither process is blocked or waiting for anything.

A human example of livelock would be two people who meet face-to-face in a corridor

and each moves aside to let the other pass, but they end up swaying from side to side

without making any progress because they always move the same way at the same time.

Addressing deadlock in real systems

Deadlock is a terrific theoretical problem for graduate students, but none of the solutions

discussed above can be implemented in a real world, general purpose operating system. It

would be difficult to require a user program to make requests for resources in a certain

way or in a certain order. As a result, most operating systems use the ostrich algorithm.

Some specialized systems have deadlock avoidance/prevention mechanisms. For

example, many database operations involve locking several records, and this can result in

deadlock, so database software often has a deadlock prevention algorithm.

The Unix file locking system lockf has a deadlock detection mechanism built into it.

Whenever a process attempts to lock a file or a record of a file, the operating system

checks to see if that process has locked other files or records, and if it has, it uses a graph

algorithm similar to the one discussed above to see if granting that request will cause

deadlock, and if it does, the request for the lock will fail, and the lockf system call will

return and errno will be set to EDEADLK.

Killing Zombies

Recall that if a child dies before its parent calls wait, the child becomes a zombie. In

some applications, a web server for example, the parent forks off lots of children but

doesn’t care whether the child is dead or alive. For example, a web server might fork a

new process to handle each connection, and each child dies when the client breaks the

connection. Such an application is at risk of producing many zombies, and zombies can

clog up the process table.

When a child dies, it sends a SIGCHLD signal to its parent. The parent process can

prevent zombies from being created by creating a signal handler routine for SIGCHLD

which calls wait whenever it receives a SIGCHLD signal. There is no danger that this

will cause the parent to block because it would only call wait when it knows that a child

has just died.

There are several versions of wait on a Unix system. The system call waitpid has this

prototype

#include <sys/types.h> #include <sys/wait.h>

pid_t waitpid(pid_t pid, int *stat_loc, int options)

This will function like wait in that it waits for a child to terminate, but this function

allows the process to wait for a particular child by setting its first argument to the pid that

we want to wait for. However, that is not our interest here. If the first argument is set to

zero, it will wait for any child to terminate, just like wait. However, the third argument

can be set to WNOHANG. This will cause the function to return immediately if there are

no dead children. It is customary to use this function rather than wait in the signal

handler.

Here is some sample code

#include <sys/types.h> #include <stdio.h>

#include <signal.h>

#include <wait.h>

#include <unistd.h> void *zombiekiller(int n)

{

int status; waitpid(0,&status,WNOHANG);

signal(SIGCHLD,zombiekiller);

return (void *) NULL; }

int main()

{

signal(SIGCHLD, zombiekiller); ....

}

Pipes

A second form of redirection is a pipe. A pipe is a connection between two processes in

which one process writes data to the pipe and the other reads from the pipe. Thus, it

allows one process to pass data to another process.

The Unix system call to create a pipe is

int pipe(int fd[2])

This function takes an array of two ints (file descriptors) as an argument. It creates a pipe

with fd[0] at one end and fd[1] at the other. Reading from the pipe and writing to the pipe

are done with the read and write calls that you have seen and used before. Although both

ends are opened for both reading and writing, by convention a process writes to fd[1] and

reads from fd[0]. Pipes only make sense if the process calls fork after creating the pipe.

Each process should close the end of the pipe that it is not using. Here is a simple

example in which a child sends a message to its parent through a pipe.

#include <unistd.h>

#include <stdio.h> int main()

{

pid_t pid; int retval;

int fd[2];

int n; retval = pipe(fd);

if (retval < 0) {

printf("Pipe failedn"); /* pipe is unlikely to fail */

exit(0); }

pid = fork();

if (pid == 0) { /* child */ close(fd[0]);

n = write (fd[1],"Hello from the child",20);

exit(0); }

else if (pid > 0) { /* parent */

char buffer[64];

close(fd[1]); n = read(fd[0],buffer,64);

buffer[n]='';

printf("I got your message: %sn",buffer); }

return 0;

}

There is no need for the parent to wait for the child to finish because reading from a pipe

will block until there is something in the pipe to read. If the parent runs first, it will try to

execute the read statement, and will immediately block because there is nothing in the

pipe. After the child writes a message to the pipe, the parent will wake up.

Pipes have a fixed size (often 4096 bytes) and if a process tries to write to a pipe which is

full, the write will block until a process reads some data from the pipe.

Here is a program which combines dup2 and pipe to redirect the output of the ls process

to the input of the more process as would be the case if the user typed

ls | more at the Unix command line.

#include <stdio.h>

#include <unistd.h>

void error(char *msg) {

perror(msg);

exit(1); }

int main()

{

int p[2], retval; retval = pipe(p);

if (retval < 0) error("pipe");

retval=fork(); if (retval < 0) error("forking");

if (retval==0) { /* child */

dup2(p[1],1); /* redirect stdout to pipe */

close(p[0]); /* don't permit this process to read from pipe */

execl("/bin/ls","ls","-l",NULL);

error("Exec of ls"); }

/* if we get here, we are the parent */

dup2(p[0],0); /* redirect stdin to pipe */

close(p[1]); /* don't permit this process to write to pipe */

execl("/bin/more","more",NULL);

error("Exec of more"); return 0;

}

Summary

A deadlock is considered to be one of the situation which whenever occur prevents the

normal flow of execution of any application. Thus needs to be understood well. To cater

this need, the unit began with providing a detailed discussion on the fundamental

concepts of deadlock followed by understanding various situations that force deadlock to

occur. Finally Unit provided a detailed coverage on methods of avoiding deadlocks to

occur and in case of their occurrence, mechanism to detect them so that precautionary

measures could be taken.

Terminal Questions

1. What do you mean by a deadlock? Explain

2. Discuss various conditions to be true for deadlock to occur.

3. Discuss various application to overcome the problem of deadlock.

4. What do you mean by a Zomby? Discuss in brief. 5. Explain the concept of pipes.

6.

Unit 7 : Concurrency Control :

This unit deals with the concurrency, race condition, critical section, mutual exclusion

and Semaphores

Introduction

Concurrency is a property of systems which execute processes overlapped in time on

single or multiple processors, and which may permit the sharing of common resources

between those overlapped processes. Concurrent use of shared resources is the source of

many difficulties, such as race conditions. Concurrency control is a method used to

ensure that processes are executed in a safe manner without affecting each other and

correct results are generated, while getting those results as quickly as possible. Mutual

exclusion is a way of making sure that if one process is using a shared modifiable data,

the other processes will be excluded from doing the same thing. The mutual exclusion

have a basic problem of busy waiting. If a process is unable to enter in to its critical

section; it tightly executes the loop of testing the shared global variable, wasting CPU

time, as well as resources. Semaphores avoid this wastage of time and resources by

blocking the process if it can not enter into its critical section. This process will be wake

up by the currently running process after coming out of critical section. Following

sections covers various aspects and issues related to concurrent transactions.

Objectives:

At the end of this unit you will be able to understand the:

• Brief introduction of Concurrency Control

• Conditions for Deadlocks

• Semaphores

What is concurrency?

“Concurrency occurs when two or more execution flows are able to run simultaneously.”

– Edsger Dijkstra.



between those overlapped processes. Concurrent use of shared resources is the source of

many difficulties, such as race conditions (as explained bellow). The introduction of

mutual exclusion can prevent race conditions, but can lead to problems such as deadlock,

and starvation.

In a single-processor multiprogramming system, processes must be are interleaved in

time to yield the appearance of simultaneous execution. In a multiple-processor system, it

is possible not only to interleave the execution of multiple processes but also to overlap

them. Interleaving and overlapping techniques can be viewed as examples of concurrent

processing

Concurrency control is a method used to ensure that processes are executed in a safe

manner (i.e., without affecting each other) and correct results are generated, while getting

those results as quickly as possible.

Race Conditions

A race condition occurs when multiple processes or threads read and write data items so

that the final result depends on the order of execution of instructions in the multiple

processes.

Suppose that two processes, P1 and P2, share the global variable A. At some point in its

execution, P1 updates variable A to the value 1, and at some point in its execution, P2

updates variable A to the value 2. Thus, the two processes are in a race to write variable

A. In this example the “loser” of the race (the process that updates last) determines the

final value of A.

Critical Section

A critical section is a part of program that accesses a shared resource (data structure or

device) that must not be concurrently accessed by more than one process of execution.

The key to preventing trouble involving shared storage is find some way to prohibit more

than one process from reading and writing the shared data simultaneously. To avoid race

conditions and flawed results, one must identify codes in Critical Sections in each

process.

Mutual Exclusion

Mutual exclusion is a way of making sure that if one process is using a shared modifiable

data, the other processes will be excluded from doing the same thing.

That is, while one process executes the shared variable, all other processes desiring to do

so at the same time moment should be kept waiting; when that process has finished using

the shared variable, one of the processes waiting to do so should be allowed to proceed.

In this fashion, each process using the shared data (variables) excludes all others from

doing so simultaneously. This is called Mutual Exclusion.

Mutual exclusion needs to be enforced only when processes access shared modifiable

data – when processes are performing operations that do not conflict with one another

they should be allowed to proceed concurrently.

Requirements for mutual exclusion

Following are the six requirements for mutual exclusion.

1. Mutual exclusion must be enforced: Only one process at a time is allowed into its

critical section, among all processes that have critical sections for the same

resource or shared object.

2. A process that halts in its non critical section must do so without interfering with

other processes.

3. It must not be possible for a process requiring access to a critical section to be

delayed indefinitely.

4. When no process is in a critical section, any process that requests entry to its

critical section must be permitted to enter without delay.

5. No assumptions are made about relative process speed or number of

processors.

6. A process remains inside its critical section for a finite time only.

Following are some of the methods for achieving mutual exclusion.

Mutual exclusion by disabling interrupts:

In an interrupt driven system, context switches from one process to another can only

occur on interrupts (timer, I/O device, etc). If a process disables all interrupts then it

cannot be switched out.

On entry to the critical section the process can disable all interrupts, and on exit from it

can enable them again as shown bellow.

while (true)

{

/* disable interrupts */;

/* critical section */;

/* enable interrupts */;

/* remainder */;

}

Figure 7.1: Mutual exclusion by disabling interrupts

Because the critical section cannot be interrupted, mutual exclusion is guaranteed. But as

the processor can not interleave processes, the system performance is degraded. Also this

solution does not work for multi processor system, where more than one process is run

concurrently.

Mutual exclusion by using Lock variable:

In this method, we consider a single, shared, (lock) variable, initially 0. When a process

wants to enter in its critical section, it first tests the lock value. If lock is 0, the process

first sets it to 1 and then enters the critical section. If the lock is already 1, the process just

waits until (lock) variable becomes 0. Thus, a 0 means that no process in its critical

section and 1 mean some process is in its critical section.

process (i)

{

while(lock != 0)

/* no operation */;

lock = 1;


lock = 0;

/* remainder */;

}

Figure 7.2: Mutual exclusion using lock variable

The flaw in this proposal can be best explained by example. Suppose process A sees that

the lock is 0. Before it can set the lock to 1 another process B is scheduled, runs, and sets

the lock to 1. When the process A runs again, it will also set the lock to 1, and two

processes will be in their critical section simultaneously. Thus this method does not

guarantee mutual exclusion.

Mutual exclusion by Strict Alternation:

In this method, the integer variable ‘turn’ keeps track of whose turn is to enter the critical

section. Initially, process 0 inspect turn, finds it to be 0, and enters in its critical section.

Process 1 also finds it to be 0 and sits in a loop continually testing ‘turn’ to see when it

becomes 1. Process 0, after coming out of critical section, sets turn to 1, to allow process

1 to enter in its critical section, as shown bellow.

/* Process 0 */

while (true)

{

while(turn != 0)

/* no operation */;


turn = 1;

/* remainder */;

}

/* Process 1 */

while (true)

{

while(turn != 1)

/* no operation */;


turn = 0;

/* remainder */;

}

Figure 7.3: Mutual exclusion by strict alternation

Taking turns is not a good idea when one of the processes is much slower than the other.

Suppose process 0 finishes its critical section quickly, and again wants to enter in its

critical section, but it can not do so, as the turn value is set to 1. It has to wait for process

1 to finish its critical section part. Here both processes are in their non-critical section.

This situation violates above mutual exclusion requirement condition no. 4.

Mutual exclusion by Peterson’s Method:

The algorithm uses two variables, flag, a boolean array and turn, an integer. A true flag

value indicates that the process wants to enter the critical section. The variable turn holds

the id of the process whose turn it is. Entrance to the critical section is granted for process

P0 if P1 does not want to enter its critical section or if P1 has given priority to P0 by

setting turn to 0.

flag[0]=false;

flag[1]=false;

turn = 0;

/* Process 0 */

while (true)

{

flag[0] = true;

turn = 1;

while(flag[1] && turn == 1)

/* no operation */;


flag[0] = false;

/* remainder */;

}

/* Process 1 */

while (true)

{

flag[1] = true;

turn = 0;

while(flag[0] && turn == 0)

/* no operation */;


flag[1] = false;

/* remainder */;

}

Figure 7.4: Peterson’s algorithm

Mutual exclusion by using Special Machine Instructions:

In a multiprocessor environment, the processors share access to a common main memory

and at the hardware level, only one access to a memory location is permitted at a time.

With this as a foundation, some computer processors designed several machine

instructions that carry out two actions, such as reading and writing, of a single memory

location. Since processes interleave at the instruction level, so such special instructions

are atomic and are not subject to interference from other processes. Two of such kind of

instructions are discussed in the following parts.

Test and Set Instruction: The test and set instruction can be defined as follows:

boolean testset (int i)

{

if (i = = 0)

{

i=1;

return true;

} else.

{

return false;

}

Figure 7.5: Test and Set Instructions

where the variable i is used like a traffic light. If it is 0, meaning green, then the

instruction sets it 1, i.e. red, and return true. Thus the current process is permitted to pass

but the others are told to stop. On the other hand, if the light is already red, then the

running process will receive false and realize not supposed to proceed.

Exchange Instruction: The exchange instruction can be defined as follows:

void exchange (int register, int memory)

{

int temp;

temp = memory;

memory = register;

register = temp;

}

Figure 7.6: Exchange Instruction

The instruction exchanges the contents of a register with that of a memory location. A

shared variable bolt is initialized to 0. Each process uses a local variable key that is

initialized to 1, and executes the instruction as exchange(key, bolt). .The only process

that may enter its critical section is one that finds bolt equal to 0. It excludes all other

processes from the critical section by setting bolt to 1. When a process leaves its critical

section, it resets bolt to 0, allowing another process to gain access to its critical section.

Semaphores

All the above methods of mutual exclusion have a basic problem of busy waiting. If a

process is unable to enter in to its critical section; it tightly executes the loop of testing

the shared global variable, wasting CPU time, as well as resources. Semaphores avoid

this wastage of time and resources by blocking the process if it can not enter into its

critical section. This process will be wake up by the currently running process after

coming out of critical section.

What are Semaphores?

A semaphore is a mechanism that prevents two or more processes from accessing a

shared resource simultaneously. On the railroads a semaphore prevents two trains from

crashing on a shared section of track. On railroads and computers, semaphores are

advisory: if a train engineer doesn’t observe and obey it, the semaphore won’t prevent a

crash, and if a process doesn’t check a semaphore before accessing a shared resource,

chaos might result.

Semaphores can be thought of as flags (hence their name, semaphores). They are either

on or off. A process can turn on the flag or turn it off. If the flag is already on, processes

that try to turn on the flag will sleep until the flag is off. Upon awakening, the process

will reattempt to turn the flag on, possibly succeeding or possibly sleeping again. Such

behavior allows semaphores to be used in implementing a post-wait driver – a system

where processes can wait for events (i.e., wait on turning on a semaphore) and post

events (i.e. turning off of a semaphore).

Dijkstra in 1965 proposed semaphores as a solution to the problems of concurrent

processes. The fundamental principle is: That two or more processes can cooperate by

means of simple signals, such that a process can be forced to stop at a specified place

until it has received a specific signal.

For signaling, special variables called semaphores are used.

Primitive signal (s) is used to transmit a signal

Primitive wait (s) is used to receive a signal

Semaphore Implementation:

To achieve desired effect, view semaphores as variables that have an integer value upon

which three operations are defined:

• A semaphore may be initialized to a non-negative value

• The wait operation decrements the semaphore value. If the value becomes

negative, the process executing the wait is blocked

• The signal operation increments the semaphore value. If the value is not positive,

then the process blocked by wait operation is unblocked. There is no other way to

manipulate semaphores

wait (S)

{

while (S£ 0); /*no-operation */;

S–;

}

�

signal (S)

{

S++;

}

Figure 7.7: Semaphore operations

Mutual Exclusion using Semaphore:

The following example illustrates mutual exclusion using semaphore:

A process before entering in to its critical section, performs wait(mutex) operating and

after coming out of critical section, signal(mutex) operation; thus achieving mutual

exclusion.

Shared data:

semaphore mutex; //initially mutex = 1

Process: Pi:

do

{

wait(mutex);

/* critical section */

signal(mutex);

/* remainder section */

} while (1);

Figure 7.8: Mutual exclusion using semaphore

Following code gives the detailed implementation of wait and signal procedures for

above example. The structure definition has semaphore value and process link. The wait

operation decrements the semaphore value, and if it is less than 0 then adds it to waiting

queue and blocks the process.

Declaration:

typedef struct

{

int value;

struct process *L;

} semaphore;

wait(S):

{

S.value–;

if (S.value < 0)

{

add this process to S.L;

block;

}

}

signal(S):

{

S.value++;

if (S.value <= 0)

{

remove a process P from S.L;

wakeup(P);

}

}

Figure 7.9: wait() and signal() for mutual exclusion

The process which is currently in critical section; after coming out increments the

semaphore value, and checks if it is less than of equal to 0. If so, it removes process from

waiting queue and then wakes up the process.

Summary



between those overlapped processes. Concurrency control is a method used to ensure that

processes are executed in a safe manner (i.e., without affecting each other) and correct

results are generated, while getting those results as quickly as possible. A race condition

occurs when multiple processes or threads read and write data items so that the final

result depends on the order of execution of instructions in the multiple processes.

Mutual exclusion is a way of making sure that if one process is using a shared modifiable

data, the other processes will be excluded from doing the same thing. Mutual exclusion

can be achieved by various ways such as using lock variable, by strict alternation, by

disabling interrupts, using Peterson’s method, through special machine instructions, and

Semaphores.

Terminal Questions

1. What is concurrency?

2. Discuss the problems caused by concurrent executions of processes.

3. What is race condition?

4. Describe critical section.

5. What is mutual exclusion? What are its requirements?

6. Explain any one method for achieving mutual exclusion.

7. Explain the Peterson’s solution for mutual exclusion.

8. What are special machine instructions? How they support mutual exclusion?

9. What are Semaphores? How can we achieve mutual exclusion using Semaphores?

Unit 8 : File Systems and Space Management :

This unit covers the file management covers the file structure, implementing file systems

and space management – Block size and extents, Free space, reliability, bad block and

back-up dumps. And consistency checking, transactions and performance discussed in

brief.

Introduction

Most operating systems provide a file system, as a file system is an integral part of any

modern operating system. Early microcomputer operating systems’ only real task was file

management – a fact reflected in their names. Some early operating systems had a

separate component for handling file systems which was called a disk operating system.

On some microcomputers, the disk operating system was loaded separately from the rest

of the operating system. On early operating systems, there was usually support for only

one, native, unnamed file system; for example, CP/M supports only its own file system,

which might be called “CP/M file system” if needed, but which didn’t bear any official

name at all. Because of this, there needs to be an interface provided by the operating

system software between the user and the file system. This interface can be textual (such

as provided by a command line interface, such as the UNIX shell, or OpenVMS DCL) or

graphical (such as provided by a graphical user interface, such as file browsers). If

graphical, the metaphor of the folder, containing documents, other files, and nested

folders is often used. This unit covers various issues related to Files.

Objectives:

At the end of this unit you will be understand the:

• Brief introduction of File Systems and Structures and their implementation

• Storage and Space management with consistency checking, Performance

evaluation and transaction related issues

• Fundamental understanding of Access Methods

File Systems

Just as the process abstraction beautifies the hardware by making a single CPU (or a

small number of CPUs) appear to be many CPUs, one per “user,” the file system

beautifies the hardware disk, making it appear to be a large number of disk-like objects

called files. Like a disk, a file is capable of storing a large amount of data cheaply,

reliably, and persistently. The fact that there are lots of files is one form of beautification:

Each file is individually protected, so each user can have his own files, without the

expense of requiring each user to buy his own disk. Each user can have lots of files,

which makes it easier to organize persistent data. The file system also makes each

individual file more beautiful than a real disk. At the very least, it erases block

boundaries, so a file can be any length (not just a multiple of the block size) and

programs can read and write arbitrary regions of the file without worrying about whether

they cross block boundaries. Some systems (not Unix) also provide assistance in

organizing the contents of a file.

Systems use the same sort of device (a disk drive) to support both virtual memory and

files. The question arises why these have to be distinct facilities, with vastly different user

interfaces. The answer is that they don’t. In Multics, there was no difference whatsoever.

Everything in Multics was a segment. The address space of each running process

consisted of a set of segments (each with its own segment number), and the “file system”

was simply a set of named segments. To access a segment from the file system, a process

would pass its name to a system call that assigned a segment number to it. From then on,

the process could read and write the segment simply by executing ordinary loads and

stores. For example, if the segment was an array of integers, the program could access the

ith

number with a notation like a[i] rather than having to seek to the appropriate offset and

then execute a read system call. If the block of the file containing this value wasn’t in

memory, the array access would cause a page fault, which was serviced.

This user-interface idea, sometimes called “single-level store,” is a great idea. So why is

it not common in current operating systems? In other words, why are virtual memory and

files presented as very different kinds of objects? There are possible explanations one

might propose:

The address space of a process is small compared to the size of a file system.

There is no reason why this has to be so. In Multics, a process could have up to 256K

segments, but each segment was limited to 64K words. Multics allowed for lots of

segments because every “file” in the file system was a segment. The upper bound of 64K

words per segment was considered large by the standards of the time; The hardware

actually allowed segments of up to 256K words (over one megabyte). Most new

processors introduced in the last few years allow 64-bit virtual addresses. In a few years,

such processors will dominate. So there is no reason why the virtual address space of a

process cannot be large enough to include the entire file system.

The virtual memory of a process is transient – it goes away when the process

terminates – while files must be persistent.

Multics showed that this doesn’t have to be true. A segment can be designated as

“permanent,” meaning that it should be preserved after the process that created it

terminates. Permanent segments to raise a need for one “file-system-like” facility, the

ability to give names to segments so that new processes can find them.

Files are shared by multiple processes, while the virtual address space of a process is

associated with only that process.

Most modern operating systems (including most variants of Unix) provide some way for

processes to share portions of their address spaces anyhow, so this is a particularly weak

argument for a distinction between files and segments.

The real reason single-level store is not ubiquitous is probably a concern for efficiency.

The usual file-system interface encourages a particular style of access: Open a file, go

through it sequentially, copying big chunks of it to or from main memory, and then close

it. While it is possible to access a file like an array of bytes, jumping around and

accessing the data in tiny pieces, it is awkward. Operating system designers have found

ways to implement files that make the common “file like” style of access very efficient.

While there appears to be no reason in principle why memory-mapped files cannot be

made to give similar performance when they are accessed in this way, in practice, the

added functionality of mapped files always seems to pay a price in performance. Besides,

if it is easy to jump around in a file, applications programmers will take advantage of it,

overall performance will suffer, and the file system will be blamed.

Naming

Every file system provides some way to give a name to each file. We will consider only

names for individual files here, and talk about directories later. The name of a file is (at

least sometimes) meant to used by human beings, so it should be easy for humans to use.

Different operating systems put different restrictions on names:

Size

Some systems put severe restrictions on the length of names. For example DOS restricts

names to 11 characters, while early versions of Unix (and some still in use today) restrict

names to 14 characters. The Macintosh operating system, Windows 95, and most modern

version of Unix allow names to be essentially arbitrarily long. I say “essentially” since

names are meant to be used by humans, so they don’t really to to be all that long. A name

that is 100 characters long is just as difficult to use as one that it forced to be under 11

characters long (but for different reasons). Most modern versions of Unix, for example,

restrict names to a limit of 255 characters.

Case

Are upper and lower case letters considered different? The Unix tradition is to consider

the names FILE1 and file1 to be completely different and unrelated names. In DOS and

its descendants, however, they are considered the same. Some systems translate names to

one case (usually upper case) for storage. Others retain the original case, but consider it

simply a matter of decoration. For example, if you create a file named “FILE1,” you

could open it as “file1″ or “FIL,” but if you list the directory, you would still see the file

listed as “Fil”.

Character Set

Different systems put different restrictions on what characters can appear in file names.

The Unix directory structure supports names containing any character other than NUL

(the byte consisting of all zero bits), but many utility programs (such as the shell) would

have troubles with names that have spaces, control characters or certain punctuation

characters (particularly ‘/’). MacOS allows all of these (e.g., it is not uncommon to see a

file name with the Copyright symbol © in it). With the world-wide spread of computer

technology, it is becoming increasingly important to support languages other than

English, and in fact alphabets other than Latin. There is a move to support character

strings (and in particular file names) in the Unicode character set, which devotes 16 bits

to each character rather than 8 and can represent the alphabets of all major modern

languages from Arabic to Devanagari to Telugu to Khmer.

Format

It is common to divide a file name into a base name and an extension that indicates the

type of the file. DOS requires that each name be compose of a bast name of eight or less

characters and an extension of three or less characters. When the name is displayed, it is

represented as base.extension. Unix internally makes no such distinction, but it is a

common convention to include exactly one period in a file name (e.g. fil.c for a C source

file).

File Structure

Unix hides the “chunkiness” of tracks, sectors, etc. and presents each file as a “smooth”

array of bytes with no internal structure. Application programs can, if they wish, use the

bytes in the file to represent structures. For example, a wide-spread convention in Unix is

to use the newline character (the character with bit pattern 00001010) to break text files

into lines. Some other systems provide a variety of other types of files. The most

common are files that consist of an array of fixed or variable size records and files that

form an index mapping keys to values. Indexed files are usually implemented as B-trees.

File Types

Most systems divide files into various “types.” The concept of “type” is a confusing one,

partially because the term “type” can mean different things in different contexts. Unix

initially supported only four types of files: directories, two kinds of special files, and

“regular” files. Just about any type of file is considered a “regular” file by Unix. Within

this category, however, it is useful to distinguish text files from binary files; within binary

files there are executable files (which contain machine-language code) and data files; text

files might be source files in a particular programming language (e.g. C or Java) or they

may be human-readable text in some mark-up language such as html (hypertext markup

language). Data files may be classified according to the program that created them or is

able to interpret them, e.g., a file may be a Microsoft Word document or Excel

spreadsheet or the output of TeX. The possibilities are endless.

In general (not just in Unix) there are three ways of indicating the type of a file:

1. The operating system may record the type of a file in meta-data stored separately

from the file, but associated with it. Unix only provides enough meta-data to

distinguish a regular file from a directory (or special file), but other systems

support more types.

2. The type of a file may be indicated by part of its contents, such as a header made

up of the first few bytes of the file. In Unix, files that store executable programs

start with a two byte magic number that identifies them as executable and selects

one of a variety of executable formats. In the original Unix executable format,

called the a.out format, the magic number is the octal number 0407, which

happens to be the machine code for a branch instruction on the PDP-11 computer,

one of the first computers to implement Unix. The operating system could run a

file by loading it into memory and jumping to the beginning of it. The 0407 code,

interpreted as an instruction, jumps to the word following the 16-byte header,

which is the beginning of the executable code in this format. The PDP-11

computer is extinct by now, but it lives on through the 0407 code!

3. The type of a file may be indicated by its name. Sometimes this is just a

convention, and sometimes it’s enforced by the OS or by certain programs. For

example, the Unix Java compiler refuses to believe that a file contains Java source

unless its name ends with .java.

Some systems enforce the types of files more vigorously than others. File types may be

enforced

• Not at all,

• Only by convention,

• By certain programs (e.g. the Java compiler), or

• By the operating system itself.

Unix tends to be very lax in enforcing types.

Access Modes

Systems support various access modes for operations on a file.

• Sequential. Read or write the next record or next n bytes of the file. Usually,

sequential access also allows a rewind operation.

• Random. Read or write the nth record or bytes i through j. Unix provides an

equivalent facility by adding a seek operation to the sequential operations listed

above. This packaging of operations allows random access but encourages

sequential access.

• Indexed. Read or write the record with a given key. In some cases, the “key”

need not be unique – there can be more than one record with the same key. In this

case, programs use a combination of indexed and sequential operations: Get the

first record with a given key, then get other records with the same key by doing

sequential reads.

Note that access modes are distinct from from file structure – e.g., a record-structured file

can be accessed either sequentially or randomly – but the two concepts are not entirely

unrelated. For example, indexed access mode only makes sense for indexed files.

File Attributes

This is the area where there is the most variation among file systems. Attributes can also

be grouped by general category.

Name

Ownership and Protection

Owner, owner’s “group,” creator, access-control list (information about who can to what

to this file, for example, perhaps the owner can read or modify it, other members of his

group can only read it, and others have no access).

Time Stamps

Time created, time last modified, time last accessed, time the attributes were last

changed, etc. Unix maintains the last three of these. Some systems record not only when

the file was last modified, but by whom.

Sizes

Current size, size limit, “high-water mark”, space consumed (which may be larger than

size because of internal fragmentation or smaller because of various compression

techniques).

Type Information

As described above: File is ASCII, is executable, is a “system” file, is an Excel spread

sheet, etc.

Misc

Some systems have attributes describing how the file should be displayed when a directly

is listed. For example MacOS records an icon to represent the file and the screen

coordinates where it was last displayed. DOS has a “hidden” attribute meaning that the

file is not normally shown. Unix achieves a similar effect by convention: The ls program

that is usually used to list files does not show files with names that start with a period

unless you explicit request it to (with the -a option).

Unix records a fixed set of attributes in the meta-data associated with a file. If you want

to record some fact about the file that is not included among the supported attributes, you

have to use one of the tricks listed above for recording type information: encode it in the

name of the file, put it into the body of the file itself, or store it in a file with a related

name. Other systems (notably MacOS and Windows NT) allow new attributes to be

invented on the fly. In MacOS, each file has a resource fork, which is a list of (attribute-

name, attribute-value) pairs. The attribute name can be any four-character string, and the

attribute value can be anything at all. Indeed, some kinds of files put the entire “contents”

of the file in an attribute and leave the “body” of the file (called the data fork) empty.


1. Discuss the three ways of indicating the type of files.

2. Explain the various types of file access modes.

3. Explain the file system attributes in brief.

Implementing File Systems

Files

We will assume that all the blocks of the disk are given block numbers starting at zero

and running through consecutive integers up to some maximum. We will further assume

that blocks with numbers that are near each other are located physically near each other

on the disk (e.g., same cylinder) so that the arithmetic difference between the numbers of

two blocks gives a good estimate how long it takes to get from one to the other. First let’s

consider how to represent an individual file. There are (at least!) four possibilities:

Contiguous

The blocks of a file are the block numbered n, n+1, n+2, …, m. We can represent any file

with a pair of numbers: the block number of of first block and the length of the file (in

blocks). The advantages of this approach are

• It’s simple

• The blocks of the file are all physically near each other on the disk and in order so

that a sequential scan through the file will be fast.

The problem with this organization is that you can only grow a file if the block following

the last block in the file happens to be free. Otherwise, you would have to find a long

enough run of free blocks to accommodate the new length of the file and copy it. As a

practical matter, operating systems that use this organization require the maximum size of

the file to be declared when it is created and pre-allocate space for the whole file. Even

then, storage allocation has all the problems we considered when studying main-memory

allocation including external fragmentation.

Linked List

A file is represented by the block number of its first block, and each block contains the

block number of the next block of the file. This representation avoids the problems of the

contiguous representation: We can grow a file by linking any disk block onto the end of

the list, and there is no external fragmentation. However, it introduces a new problem:

Random access is effectively impossible. To find the 100th block of a file, we have to

read the first 99 blocks just to follow the list. We also lose the advantage of very fast

sequential access to the file since its blocks may be scattered all over the disk. However,

if we are careful when choosing blocks to add to a file, we can retain pretty good

sequential access performance.

Both the space overhead (the percentage of the space taken up by pointers) and the time

overhead (the percentage of the time seeking from one place to another) can be decreased

by using larger blocks. The hardware designer fixes the block size (which is usually quite

small) but the software can get around this problem by using “virtual” blocks, sometimes

called clusters. The OS simply treats each group of (say) four contiguous physical disk

sectors as one cluster. Large, clusters, particularly if they can be variable size, are

sometimes called extents. Extents can be thought of as a compromise between linked and

contiguous allocation.

Disk Index

The idea here is to keep the linked-list representation, but take the link fields out of the

blocks and gather them together all in one place. This approach is used in the “FAT” file

system of DOS, OS/2 and older versions of Windows. At some fixed place on disk,

allocate an array I with one element for each block on the disk, and move the link field

from block n to I[m]. The whole array of links, called a file access table (FAT) is now

small enough that it can be read into main memory when the systems starts up. Accessing

the 100th block of a file still requires walking through 99 links of a linked list, but now

the entire list is in memory, so time to traverse it is negligible (recall that a single disk

access takes as long as 10’s or even 100’s of thousands of instructions). This

representation has the added advantage of getting the “operating system” stuff (the links)

out of the pages of “user data”. The pages of user data are now full-size disk blocks, and

lots of algorithms work better with chunks that are a power of two bytes long. Also, it

means that the OS can prevent users (who are notorious for screwing things up) from

getting their grubby hands on the system data.

The main problem with this approach is that the index array I can get quite large with

modern disks. For example, consider a 2 GB disk with 2K blocks. There are million

blocks, so a block number must be at least 20 bits. Rounded up to an even number of

bytes, that’s 3 bytes–4 bytes if we round up to a word boundary–so the array I is three or

four megabytes. While that’s not an excessive amount of memory given today’s RAM

prices, if we can get along with less, there are better uses for the memory.

File Index

Although a typical disk may contain tens of thousands of files, only a few of them are

open at any one time, and it is only necessary to keep index information about open files

in memory to get good performance. Unfortunately the whole-disk index described in the

previous paragraph mixes index information about all files for the whole disk together,

making it difficult to cache only information about open files. The inode structure

introduced by Unix groups together index information about each file individually. The

basic idea is to represent each file as a tree of blocks, with the data blocks as leaves. Each

internal block (called an indirect block in Unix jargon) is an array of block numbers,

listing its children in order. If a disk block is 2K bytes and a block number is four bytes,

512 block numbers fit in a block, so a one-level tree (a single root node pointing directly

to the leaves) can accommodate files up to 512 blocks, or one megabyte in size. If the

root node is cached in memory, the “address” (block number) of any block of the file can

be found without any disk accesses. A two-level tree, with 513 total indirect blocks, can

handle files 512 times as large (up to one-half gigabyte).

The only problem with this idea is that it wastes space for small files. Any file with more

than one block needs at least one indirect block to store its block numbers. A 4K file

would require three 2K blocks, wasting up to one third of its space. Since many files are

quite small, this is serious problem. The Unix solution is to use a different kind of

“block” for the root of the tree.

An index node (or inode for short) contains almost all the meta-data about a file listed

above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are small

enough that several of them can be packed into one disk block. In addition to the meta-

data, an inode contains the block numbers of the first few blocks of the file. What if the

file is too big to fit all its block numbers into the inode? The earliest version of Unix had

a bit in the meta-data to indicate whether the file was “small” or “big.” For a big file, the

inode contained the block numbers of indirect blocks rather than data blocks. More recent

versions of Unix contain pointers to indirect blocks in addition to the pointers to the first

few data blocks. The inode contains pointers to (i.e., block numbers of) the first few

blocks of the file, a pointer to an indirect block containing pointers to the next several

blocks of the file, a pointer to a doubly indirect block, which is the root of a two-level

tree whose leaves are the next blocks of the file, and a pointer to a triply indirect block. A

large file is thus a lop-sided tree.

A real-life example is given by the Solaris 2.5 version of Unix. Block numbers are four

bytes and the size of a block is a parameter stored in the file system itself, typically 8K

(8192 bytes), so 2048 pointers fit in one block. An inode has direct pointers to the first 12

blocks of the file, as well as pointers to singly, doubly, and triply indirect blocks. A file

of up to 12+2048+2048*2048 = 4,196,364 blocks or 34,376,613,888 bytes

(about 32 GB) can be represented without using triply indirect blocks,

and with the triply indirect block, the maximum file size is

(12+2048+2048*2048+2048*2048*2048)*8192 = 70,403,120,791,552 bytes (slightly

more than 246

bytes, or about 64 terabytes). Of course, for such huge files, the size of the

file cannot be represented as a 32-bit integer. Modern versions of Unix store the file

length as a 64-bit integer, called a “long” integer in Java. An inode is 128 bytes long,

allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in one disk

block. Since the inode for a file is kept in memory while the file is open, locating an

arbitrary block of any file requires reading at most three I/O operations, not counting the

operation to read or write the data block itself.

Directories

A directory is simply a table mapping character-string human-readable names to

information about files. The early PC operating system CP/M shows how simple a

directory can be. Each entry contains the name of one file, its owner, size (in blocks) and

the block numbers of 16 blocks of the file. To represent files with more than 16 blocks,

CP/M used multiple directory entries with the same name and different values in a field

called the extent number. CP/M had only one directory for the entire system.

DOS uses a similar directory entry format, but stores only the first block number of the

file in the directory entry. The entire file is represented as a linked list of blocks using the

disk index scheme described above. All but the earliest version of DOS provide

hierarchical directories using a scheme similar to the one used in Unix.

Unix has an even simpler directory format. A directory entry contains only two fields: a

character-string name (up to 14 characters) and a two-byte integer called an inumber,

which is interpreted as an index into an array of inodes in a fixed, known location on

disk. All the remaining information about the file (size, ownership, time stamps,

permissions, and an index to the blocks of the file) are stored in the inode rather than the

directory entry. A directory is represented like any other file (there’s a bit in the inode to

indicate that the file is a directory). Thus the inumber in a directory entry may designate a

“regular” file or another directory, allowing arbitrary graphs of nodes. However, Unix

carefully limits the set of operating system calls to ensure that the set of directories is

always a tree. The root of the tree is the file with inumber 1 (some versions of Unix use

other conventions for designating the root directory). The entries in each directory point

to its children in the tree. For convenience, each directory also two special entries: an

entry with name “..”, which points to the parent of the directory in the tree and an entry

with name “.”, which points to the directory itself. Inumber 0 is not used, so an entry is

marked “unused” by setting its inumber field to 0.


1. What is Block? Write its advantages.

2. Explain the disk index with its advantages over the Operating Systems.

3. Explain the UNIX directory format with a suitable exaple.

Space Management

Block Size and Extents

All of the file organizations I’ve mentioned store the contents of a file in a set of disk

blocks. How big should a block be? The problem with small blocks is I/O overhead.

There is a certain overhead to read or write a block beyond the time to actually transfer

the bytes. If we double the block size, a typical file will have half as many blocks.

Reading or writing the whole file will transfer the same amount of data, but it will

involve half as many disk I/O operations. The overhead for an I/O operations includes a

variable amount of latency (seek time and rotational delay) that depends on how close the

blocks are to each other, as well as a fixed overhead to start each operation and respond

to the interrupt when it completes.

Many years ago, researchers at the University of California at Berkeley studied the

original Unix file system. They found that when they tried reading or writing a single

very large file sequentially, they were getting only about 2% of the potential speed of the

disk. In other words, it took about 50 times as long to read the whole file as it would if

they simply read that many sequential blocks directly from the raw disk (with no file

system software). They tried doubling the block size (from 512 bytes to 1K) and the

performance more than doubled. The reason the speed more than doubled was that it took

less than half as many I/O operations to read the file. Because the blocks were twice as

large, twice as much of the file’s data was in blocks pointed to directly by the inode.

Indirect blocks were twice as large as well, so they could hold twice as many pointers.

Thus four times as much data could be accessed through the singly indirect block without

resorting to the doubly indirect block.

If doubling the block size more than doubled performance, why stop there? Why didn’t

the Berkeley folks make the blocks even bigger? The problem with big blocks is internal

fragmentation. A file can only grow in increments of whole blocks. If the sizes of files

are random, we would expect on the average that half of the last block of a file is wasted.

If most files are many blocks long, the relative amount of waste is small, but if the block

size is large compared to the size of a typical file, half a block per file is significant. In

fact, if files are very small (compared to the block size), the problem is even worse. If, for

example, we choose a block size of 8k and the average file is only 1K bytes long, we

would be wasting about 7/8 of the disk.

Most files in a typical Unix system are very small. The Berkeley researchers made a list

of the sizes of all files on a typical disk and did some calculations of how much space

would be wasted by various block sizes. Simply rounding the size of each file up to a

multiple of 512 bytes resulted in wasting 4.2% of the space. Including overhead for

inodes and indirect blocks, the original 512-byte file system had a total space overhead of

6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead

would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile?

The answer depends on economics. In those days disks were very expensive, and a

wasting half the disk seemed extreme. These days, disks are cheap, and for many

applications people would be happy to pay twice as much per byte of disk space to get a

disk that was twice as fast.

But there’s more to the story. The Berkeley researchers came up with the idea of breaking

up the disk into blocks and fragments. For example, they might use a block size of 2k and

a fragment size of 512 bytes. Each file is stored in some number of whole blocks plus 0

to 3 fragments at the end. The fragments at the end of one file can share a block with

fragments of other files. The problem is that when we want to append to a file, there may

not be any space left in the block that holds its last fragment. In that case, the Berkeley

file system copies the fragments to a new (empty) block. A file that grows a little at a

time may require each of its fragments to be copied many times. They got around this

problem by modifying application programs to buffer their data internally and add it to a

file a whole block’s worth at a time. In fact, most programs already used library routines

to buffer their output (to cut down on the number of system calls), so all they had to do

was to modify those library routines to use a larger buffer size. This approach has been

adopted by many modern variants of Unix. The Solaris system you are using for this

course uses 8k blocks and 1K fragments.

As disks get cheaper and CPU’s get faster, wasted space is less of a problem and the

speed mismatch between the CPU and the disk gets worse. Thus the trend is towards

larger and larger disk blocks.

At first glance it would appear that the OS designer has no say in how big a block is. Any

particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use

larger “blocks”. For example, if we think it would be a good idea to use 2K blocks, we

can group together each run of four consecutive sectors and call it a block. In fact, it

would even be possible to use variable-sized “blocks,” so long as each one is a multiple

of the sector size. A variable-sized “block” is called an extent. When extents are used,

they are usually used in addition to multi-sector blocks. For example, a system may use

2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1

to 10 blocks. When a file is opened for writing, it grows by adding an extent at a time.

When it is closed, the unused blocks at the end of the last extent are returned to the

system. The problem with extents is that they introduce all the problems of external

fragmentation that we saw in the context of main memory allocation. Extents are

generally only used in systems such as databases, where high-speed access to very large

files is important.

Free Space

We have seen how to keep track of the blocks in each file. How do we keep track of the

free blocks – blocks that are not in any file? There are two basic approaches.

• Use a bit vector. That is simply an array of bits with one bit for each block on the

disk. A 1 bit indicates that the corresponding block is allocated (in some file) and

a 0 bit says that it is free. To allocate a block, search the bit vector for a zero bit,

and set it to one.

• Use a free list. The simplest approach is simply to link together the free blocks by

storing the block number of each free block in the previous free block. The

problem with this approach is that when a block on the free list is allocated, you

have to read it into memory to get the block number of the next block in the list.

This problem can be solved by storing the block numbers of additional free blocks

in each block on the list. In other words, the free blocks are stored in a sort of

lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of

the free blocks would be linked into a list. Each block on the list would contain a

pointer to the next block on the list, as well as pointers to 127 additional free

blocks. When the first block of the list is allocated to a file, it has to be read into

memory to get the block numbers stored in it, but then we and allocate 127 more

blocks without reading any of them from disk. Freeing blocks is done by running

this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a

block is freed, add its block number to this cache. If the cache is full when a block

is freed, use the block being freed to hold all the block numbers in the cache and

link it to the head of the free list by adding to it the block number of the previous

head of the list.

How do these methods compare? Neither requires significant space overhead on disk.

The bitmap approach needs one bit for each block. Even for a tiny block size of 512

bytes, each bit of the bitmap describes 512*8 = 4096 bits of free space, so the overhead is

less than 1/40 of 1%. The free list is even better. All the pointers are stored in blocks that

are free anyhow, so there is no space overhead (except for one pointer to the head of the

list). Another way of looking at this is that when the disk is full (which is the only time

we should be worried about space overhead!) the free list is empty, so it takes up no

space. The real advantage of bitmaps over free lists is that they give the space allocator

more control over which block is allocated to which file. Since the blocks of a file are

generally accessed together, we would like them to be near each other on disk. To ensure

this clustering, when we add a block to a file we would like to choose a free block that is

near the other blocks of a file. With a bitmap, we can search the bitmap for an appropriate

block. With a free list, we would have to search the free list on disk, which is clearly

impractical. Of course, to search the bitmap, we have to have it all in memory, but since

the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the

entire bitmap in memory all the time. To do the comparable operation with a free list, we

would need to keep the block numbers of all free blocks in memory. If a block number is

four bytes (32 bits), that means that 32 times as much memory would be needed for the

free list as for a bitmap. For a concrete example, consider a 2 gigabyte disk with 8K

blocks and 4-byte block numbers. The disk contains 231

/213

= 218

= 262,144 blocks. If

they are all free, the free list has 262,144 entries, so it would take one megabyte of

memory to keep them all in memory at once. By contrast, a bitmap requires 218

bits, or

215

= 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount

of memory regardless of the number of blocks that are free).

Reliability

Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile

memory. There are several techniques that can be used to mitigate the effects of these

failures. We only have room for a brief survey.

Bad-block Forwarding

When the disk drive writes a block of data, it also writes a checksum, a small number of

additional bits whose value is some function of the “user data” in the block. When the

block is read back in, the checksum is also read and compared with the data. If either the

data or checksum were corrupted, it is extremely unlikely that the checksum comparison

will succeed. Thus the disk drive itself has a way of discovering bad blocks with

extremely high probability.

The hardware is also responsible for recovering from bad blocks. Modern disk drives do

automatic bad-block forwarding. The disk drive or controller is responsible for mapping

block numbers to absolute locations on the disk (cylinder, track, and sector). It holds a

little bit of space in reserve, not mapping any block numbers to this space. When a bad

block is discovered, the disk allocates one of these reserved blocks and maps the block

number of the bad block to the replacement block. All references to this block number

access the replacement block instead of the bad block. There are two problems with this

scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to be

bad from the beginning, because of small defects in the surface coating of the disk

platters. There is usually a stand-alone formatting program that tests all the blocks on the

disk and sets up forwarding entries for those that fail. Thus the bad blocks never get used

in the first place. The main reason for the forwarding is that it is just too hard (expensive)

to create a disk with no defects. It is much more economical to manufacture a “pretty

good” disk and then use bad-block forwarding to work around the few bad blocks. The

other problem is that forwarding interferes with the OS’s attempts to lay out files

optimally. The OS may think it is doing a good job by assigning consecutive blocks of a

file to consecutive block numbers, but if one of those blocks is forwarded, it may be very

far away for the others. In practice, this is not much of a problem since a disk typically

has only a handful of forwarded sectors out of millions.

The software can also help avoid bad blocks by simply leaving them out of the free list

(or marking them as allocated in the allocation bitmap).

Back-up Dumps

There are a variety of storage media that are much cheaper than (hard) disks but are also

much slower. An example is 8 millimeter video tape. A “two-hour” tape costs just a few

dollars and can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts

several hundred dollars. On the other hand, while worst-case access time to a hard drive

is a few tens of milliseconds, rewinding or fast-forwarding a tape to desired location can

take several minutes. One way to use tapes is to make periodic back up dumps. Dumps

are really used for two different purposes:

• To recover lost files. Files can be lost or damaged by hardware failures, but far

more often they are lost through software bugs or human error (accidentally

deleting the wrong file). If the file is saved on tape, it can be restored.

• To recover from catastrophic failures. An entire disk drive can fail, or the whole

computer can be stolen, or the building can burn down. If the contents of the disk

have been saved to tape, the data can be restored (to a repaired or replacement

disk). All that is lost is the work that was done since the information was dumped.

Corresponding to these two ways of using dumps, there are two ways of doing dumps. A

physical dump simply copies all of the blocks of the disk, in order, to tape. It’s very fast,

both for doing the dump and for recovering a whole disk, but it makes it extremely slow

to recover any one file. The blocks of the file are likely to be scattered all over the tape,

and while seeks on disk can take tens of milliseconds, seeks on tape can take tens or

hundreds of seconds. The other approach is a logical dump, which copies each file

sequentially. A logical dump makes it easy to restore individual files. It is even easier to

restore files if the directories are dumped separately at the beginning of the tape, or if the

name(s) of each file are written to the tape along with the file.

The problem with logical dumping is that it is very slow. Dumps are usually done

much more frequently than restores. For example, you might dump your disk every

night for three years before something goes wrong and you need to do a restore. An

important trick that can be used with logical dumps is to only dump files that have

changed recently. An incremental dump saves only those files that have been

modified since a particular date and time. Fortunately, most file systems record the

time each file was last modified. If you do a backup each night, you can save only

those files that have changed since the last backup. Every once in a while (say once a

month), you can do a full backup of all files. In Unix jargon, a full backup is called an

epoch (pronounced “eepock”) dump, because it dumps everything that has changed

since “the epoch”–January 1, 1970, which is the the earliest possible date in Unix.

The Computer Sciences department currently does backup dumps on about 260 GB of

disk space. Epoch dumps are done once every 14 days, with the timing on different

file systems staggered so that about 1/14 of the data is dumped each night. Daily

incremental dumps save about 6-10% of the data on each file system.

Incremental dumps go fast because they dump only a small fraction of the files, and they

don’t take up a lot of tape. However, they introduce new problems:

• If you want to restore a particular file, you need to know when it was last

modified so that you know which dump tape to look at.

• If you want to restore the whole disk (to recover from a catastrophic failure), you

have to restore from the last epoch dump, and then from every incremental dump

since then, in order. A file that is modified every day will appear on every tape.

Each restore will overwrite the file with a newer version. When you’re done,

everything will be up-to-date as of the last dump, but the whole process can be

extremely slow (and labor-intensive).

• You have to keep around all the incremental tapes since the last epoch. Tapes are

cheap, but they’re not free, and storing them can be a hassle.

The First problem can be solved by keeping a directory of what was dumped when. A

bunch of UW alumni (the same person who invented NFS) have made themselves

millionaires by marketing software to do this. The other problems can be solved by a

clever trick. Each dump is assigned a positive integer level. A level n dump is an

incremental dump that dumps all files that have changed since the most recent previous

dump with a level greater than or equal to n. An epoch dump is considered to have

infinitely high level. Levels are assigned to dumps as follows:

This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps

only save files that have changed in the previous day. Level-2 dumps save files that have

changed in the last two days, level-3 dumps cover four days, level-4 dumps cover 8 days,

etc. Higher-level dumps will thus include more files (so they will take longer to do), but

they are done infrequently. The nice thing about this scheme is that you only need to save

one tape from each level, and the number of levels is the logarithm of the interval

between epoch dumps. Thus even if did a dump each night and you only did an epoch

dump only once a year, you would need only nine levels (hence nine tapes). That also

means that a full restore needs at worst one restore from each of nine tapes (rather than

365 tapes!). To figure out what tapes you need to restore from if your disk is destroyed

after dump number n, express n in binary, and number the bits from right to left, starting

with 1. The 1 bits tell you which dump tapes to use. Restore them in order of decreasing

level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th dump,

you only need to restore from the epoch dump and from the most recent dumps at levels 5

and 3.


1. Explain how the block size is affected on I/O operation to read the file.

2. Explain how do you keep a track of the free blocks that are not in any file?

3. Explain the techniques that can be used to mitigate the effects of the disk fail,

system crash and losing the content of volatile memory.

Consistency Checking

Some of the information in a file system is redundant. For example, the free list could be

reconstructed by checking which blocks are not in any file. Redundancy arises because

the same information is represented in different forms to make different operations faster.

If you want to know which blocks are in a given file, look at the inode. If you you want to

know which blocks are not in any inode, use the free list. Unfortunately, various

hardware and software errors can cause the data to become inconsistent. File systems

often include a utility that checks for consistency and optionally attempts to repair

inconsistencies. These programs are particularly handy for cleaning up the disks after a

crash.

Unix has a utility called fscheck. It has two principal tasks. First, it checks that blocks are

properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list

is supposed to be a tree of blocks, and each block is supposed to appear in exactly one of

these trees. Fscheck runs through all the inodes, checking each allocated inode for

reasonable values, and walking through the tree of blocks rooted at the inode. It maintains

a bit vector to record which blocks have been encountered. If block is encountered that

has already been seen, there is a problem: Either it occurred twice in the same file (in

which case it isn’t a tree), or it occurred in two different files. A reasonable recovery

would be to allocate a new block, copy the contents of the problem block into it, and

substitute the copy for the problem block in one of the two places where it occurs. It

would also be a good idea to log an error message so that a human being can check up

later to see what’s wrong. After all the files are scanned, any block that hasn’t been found

should be on the free list. It would be possible to scan the free list in a similar manner,

but it’s probably easier just to rebuild the free list from the set of blocks that were not

found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply

overwrite the file system’s bitmap with the bitmap constructed during the scan.

The other main consistency requirement concerns the directory structure. The set of

directories is supposed to be a tree, and each inode is supposed to have a link count that

indicates how many times it appears in directories. The tree structure could be checked

by a recursive walk through the directories,but it is more efficient to combine this check

with the walk through the inodes that checks for disk blocks, but recording, for each

directory inode encountered, the inumber of its parent. The set of directories is a tree if

and only if and only if every directory other than the root has a unique parent. This pass

can also rebuild the link count for each inode by maintaining in memory an array with

one slot for each inumber. Each time the inumber is found in a directory, increment the

corresponding element of the array. The resulting counts should match the link counts in

the inodes. If not, correct the counts in the inodes.

This illustrates a very important principal that pops up throughout operating system

implementation (indeed, throughout any large software system): the doctrine of hints and

absolutes. Whenever the same fact is recorded in two different ways, one of them should

be considered the absolute truth, and the other should be considered a hint. Hints are

handy because they allow some operations to be done much more quickly that they could

if only the absolute information was available. But if the hint and the absolute do not

agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there

should be some way to verify a hint whenever it is used. Unix is a bit lax about this. The

link count is a hint (the absolute information is a count of the number of times the

inumber appears in directories), but Unix treats it like an absolute during normal

operation. As a result, a small error can snowball into completely trashing the file system.

For another example of hints, each allocated block could have a header containing the

inumber of the file containing it and its offset in the file. There are systems that do this

(Unix isn’t one of them). The tree of blocks rooted at an inode then becomes a hint,

providing an efficient way of finding a block, but when the block is found, its header

could be checked. Any inconsistency would then be caught immediately, and the inode

structures could be rebuilt from the information in the block headers.

By the way, if the link count calculated by the scan is zero (i.e., the inode, although

marked as allocated, does not appear in any directory), it would not be prudent to delete

the file. A better recovery is to add an entry to a special lost+found directory pointing to

the orphan inode, in case it contains something really valuable.

Transactions

The previous section talks about how to recover from situations that “can’t happen.” How

do these problems arise in the first place? Wouldn’t it be better to prevent these problems

rather than recover from them after the fact? Many of these problems arise, particularly

after a crash, because some operation was “half-completed.” For example, suppose the

system was in the middle of executing a unlink system call when the lights went out. An

unlink operation involves several distinct steps:

• remove an entry from a directory,

• decrement a link count, and if the count goes to zero,

• move all the blocks of the file to the free list, and

• free the inode.

If the crash occurs between the first and second steps, the link count will be wrong. If it

occurs during the third step, a block may be linked both into the file and the free list, or

neither, depending on the details of how the code is written. And so on…

To deal with this kind of problem in a general way, transactions were invented.

Transactions were first developed in the context of database management systems, and

are used heavily there, so there is a tradition of thinking of them as “database stuff” and

teaching about them only in database courses and text books. But they really are an

operating system concept. Here’s a two-bit introduction.

We have already seen a mechanism for making complex operations appear atomic. It is

called a critical section. Critical sections have a property that is sometimes called

synchronization atomicity. It is also called serializability because if two processes try to

execute their critical sections at about the same time, the next effect will be as if they

occurred in some serial order. If systems can crash (and they can!), synchronization

atomicity isn’t enough. We need another property, called failure atomicity, which means

an “all or nothing” property: Either all of the modifications of nonvolatile storage

complete or none of them do.

There are basically two ways to implement failure atomicity. They both depend on the

fact that a writing a single block to disk is an atomic operation. The first approach is

called logging. An append-only file called a log is maintained on disk. Each time a

transaction does something to file-system data, it creates a log record describing the

operation and appends it to the log. The log record contains enough information to undo

the operation. For example, if the operation made a change to a disk block, the log record

might contain the block number, the length and offset of the modified part of the block,

and the the original content of that region. The transaction also writes a begin record

when it starts, and a commit record when it is done. After a crash, a recovery process

scans the log looking for transactions that started (wrote a begin record) but never

finished (wrote a commit record). If such a transaction is found, its partially completed

operations are undone (in reverse order) using the undo information in the log records.

Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the

cached copy and only written back out to disk from time to time. If the system crashes

before the changes are written to disk, the data structures on disk may be inconsistent.

Logging can also be used to avoid this problem by putting into each log record redo

information as well as undo information. For example, the log record for a modification

of a disk block should contain both the old and new value. After a crash, if the recovery

process discovers a transaction that has completed, it uses the redo information to make

sure the effects of all of its operations are reflected on disk. Full recovery is always

possible provided

• The log records are written to disk in order,

• The commit record is written to disk when the transaction completes, and

• The log record describing a modification is written to disk before any of the

changes made by that operation are written to disk.

This algorithm is called write-ahead logging.

The other way of implementing transactions is called shadow blocks.5 Suppose the data

structure on disk is a tree. The basic idea is never to change any block (disk block) of the

data structure in place. Whenever you want to modify a block, make a copy of it (called a

shadow of it) instead, and modify the parent to point to the shadow. Of course, to make

the parent point to the shadow you have to modify it, so instead you make a shadow of

the parent an modify it instead. In this way, you shadow not only each block you really

wanted to modify, but also all the blocks on the path from it to the root. You keep the

shadow of the root block in memory. At the end of the transaction, you make sure the

shadow blocks are all safely written to disk and then write the shadow of the root directly

onto the root block. If the system crashes before you overwrite the root block, there will

be no permanent change to the tree on disk. Overwriting the root block has the effect of

linking all the modified (shadow blocks) into the tree and removing all the old blocks.

Crash recovery is simply a matter of garbage collection. If the crash occurs before the

root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they

replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the

garbage blocks (they are blocks that aren’t in the tree).

Database systems almost universally use logging, and shadowing is mentioned only in

passing in database texts. But the shadowing technique is used in a variant of the Unix

file system called (somewhat misleadingly) the Log-structured File System (LFS). The

entire file system is made into a tree by replacing the array of inodes with a tree of

inodes. LFS has the added advantage (beyond reliability) that all blocks are written

sequentially, so write operations are very fast. It has the disadvantage that files that are

modified here and there by random access tend to have their blocks scattered about, but

that pattern of access is comparatively rare, and there are techniques to cope with it when

it occurs. The main source of complexity in LFS is figuring out when and how to do the

“garbage collection.”

Performance

The main trick to improve file system performance (like anything else in computer

science) is caching. The system keeps a disk cache (sometimes also called a buffer pool)

of recently used disk blocks. In contrast with the page frames of virtual memory, where

there were all sorts of algorithms proposed for managing the cache, management of the

disk cache is pretty simple. On the whole, it is simply managed LRU (least recently

used). Why is it that for paging we went to great lengths trying to come up with an

algorithm that is “almost as good as LRU” while here we can simply use true LRU? The

problem with implementing LRU is that some information has to be updated on every

single reference. In the case of paging, references can be as frequent as every instruction,

so we have to make do with whatever information hardware is willing to give us. The

best we can hope for is that the paging hardware will set a bit in a page-table entry. In the

case of file system disk blocks, however, each reference is the result of a system call, and

adding a few extra instructions added to a system call for cache maintenance is not

unreasonable.

Summary

File Systems and Space Management is an integral part of the operating systems. These

section coves the file management and space management systems, which includes the

file structure, file types and different file access modes etc. and also deals with the

implementing file systems. In space management coves the block size and extents,

keeping track of free space basic approaches. It covers the disk reliability techniques

Terminal Questions

1. What do you mean by file? Explain the significance.

2. Explain why virtual memory and files are different kinds of objects.

3. Discuss the file structure? Explain the various access modes.

4. Discuss the various file organization methods?

5. What do you mean by a block & an Extent?

6. Discuss the concept of space management.

7. What do you mean by consistency Checking, discuss how it will effect on file

system.

Unit 9 : Input-Output Architecture :

This unit covers the I/O structure , I/O control strategies, Program-controlled I/O,

Interrupt-controlled I/O Direct Memory Access and cover the I/O address space.

Introduction

In our discussion of the memory hierarchy (in Unit 4), it was implicitly assumed that

memory in the computer system would be “fast enough” to match the speed of the

processor (at least for the highest elements in the memory hierarchy) and that no special

consideration need be given about how long it would take for a word to be transferred

from memory to the processor – an address would be generated by the processor, and

after some fixed time interval, the memory system would provide the required

information. (In the case of a cache miss, the time interval would be longer, but generally

still fixed. For a page fault, the processor would be interrupted; and the page fault

handling software invoked.)

Although input-output devices are “mapped” to appear like memory devices in many

computer systems, I/O devices have characteristics quite different from memory devices,

and often pose special problems for computer systems. This is principally for two

reasons:

• I/O devices span a wide range of speeds. (e.g. terminals accepting input at a few

characters per second; disks reading data at over 10 million characters / second).

• Unlike memory operations, I/O operations and the CPU are not generally

synchronized with each other.

Objectives

At the end of this unit, you will be able to understand the :

• Fundamentals and significance of I/O Operations

• I/O structure for a medium-scale processor system

• I/O Control Strategies

• Various Mechanisms for I/O Operations

I/O structure

Figure-1 shows the general I/O structure associated with many medium-scale processors.

Note that the I/O controllers and main memory are connected to the main system bus.

The cache memory (usually found on-chip with the CPU) has a direct connection to the

processor, as well as to the system bus.

Figure 1: A general I/O structure for a medium-scale processor system

Note that the I/O devices shown here are not connected directly to the system bus, they

interface with another device called an I/O controller. In simpler systems, the CPU may

also serve as the I/O controller, but in systems where throughput and performance are

important, I/O operations are generally handled outside the processor.

Until relatively recently, the I/O performance of a system was somewhat of an

afterthought for systems designers. The reduced cost of high-performance disks,

permitting the proliferation of virtual memory systems, and the dramatic reduction in the

cost of high-quality video display devices, have meant that designers must pay much

more attention to this aspect to ensure adequate performance in the overall system.

Because of the different speeds and data requirements of I/O devices, different I/O

strategies may be useful, depending on the type of I/O device which is connected to the

computer. Because the I/O devices are not synchronized with the CPU, some information

must be exchanged between the CPU and the device to ensure that the data is received

reliably. This interaction between the CPU and an I/O device is usually referred to as

“handshaking”. For a complete “handshake,” four events are important:

• The device providing the data (the talker) must indicate that valid data is now

available.

• The device accepting the data (the listener) must indicate that it has accepted the

data. This signal informs the talker that it need not maintain this data word on the

data bus any longer.

• The talker indicates that the data on the bus is no longer valid, and removes the

data from the bus. The talker may then set up new data on the data bus.

• The listener indicates that it is not now accepting any data on the data bus. the

listener may use data previously accepted during this time, while it is waiting for

more data to become valid on the bus.

Note that each of the talker and listener supply two signals. The talker supplies a signal

(say, data valid, or DAV) at step (1). It supplies another signal (say, data not valid, or

) at step (3). Both these signals can be coded as a single binary value (DAV) which

takes the value 1 at step (1) and 0 at step (3). The listener supplies a signal (say, data

accepted, or DAC) at step (2). It supplies a signal (say, data not now accepted, or ) at

step (4). It, too, can be coded as a single binary variable, DAC. Because only two binary

variables are required, the handshaking information can be communicated over two

wires, and the form of handshaking described above is called a two wire Handshake.

Other forms of handshaking are used in more complex situations; for example, where

there may be more than one controller on the bus, or where the communication is among

several devices. Figure 2 shows a timing diagram for the signals DAV and DAC which

identifies the timing of the four events described previously.

Figure 2: Timing diagram for two-wire handshake

Either the CPU or the I/O device can act as the talker or the listener. In fact, the CPU may

act as a talker at one time and a listener at another. For example, when communicating

with a terminal screen (an output device) the CPU acts as a talker, but when

communicating with a terminal keyboard (an input device) the CPU acts as a listener.


1. Explain the general I/O structure for a medium scale processor system with neat

diagram.

2. What do you mean by ‘handshaking’, write the important four events in this

context.

I/O Control Strategies

Several I/O strategies are used between the computer system and I/O devices, depending

on the relative speeds of the computer system and the I/O devices. The simplest strategy

is to use the processor itself as the I/O controller, and to require that the device follow a

strict order of events under direct program control, with the processor waiting for the I/O

device at each step.

Another strategy is to allow the processor to be “interrupted” by the I/O devices, and to

have a (possibly different) “interrupt handling routine” for each device. This allows for

more flexible scheduling of I/O events, as well as more efficient use of the processor.

(Interrupt handling is an important component of the operating system.)

A third general I/O strategy is to allow the I/O device, or the controller for the device,

access to the main memory. The device would write a block of information in main

memory, without intervention from the CPU, and then inform the CPU in some way that

that block of memory had been overwritten or read. This might be done by leaving a

message in memory, or by interrupting the processor. (This is generally the I/O strategy

used by the highest speed devices – hard disks and the video controller.)

Program-controlled I/O

One common I/O strategy is program-controlled I/O, (often called polled I/O). Here all

I/O is performed under control of an “I/O handling procedure,” and input or output is

initiated by this procedure.

The I/O handling procedure will require some status information (handshaking

information) from the I/O device (e.g., whether the device is ready to receive data). This

information is usually obtained through a second input from the device; a single bit is

usually sufficient, so one input “port” can be used to collect status, or handshake,

information from several I/O devices. (A port is the name given to a connection to an I/O

device; e.g., to the memory location into which an I/O device is mapped). An I/O port is

usually implemented as a register (possibly a set of D flip flops) which also acts as a

buffer between the CPU and the actual I/O device. The word port is often used to refer to

the buffer itself.

Typically, there will be several I/O devices connected to the processor; the processor

checks the “status” input port periodically, under program control by the I/O handling

procedure. If an I/O device requires service, it will signal this need by altering its input to

the “status” port. When the I/O control program detects that this has occurred (by reading

the status port) then the appropriate operation will be performed on the I/O device which

requested the service. A typical configuration might look somewhat as shown in

Figure – 3. The outputs labeled “handshake in” would be connected to bits in the “status”

port. The input labeled “handshake in” would typically be generated by the appropriate

decode logic when the I/O port corresponding to the device was addressed.

Figure 3:

Program controlled I/O

Program-controlled I/O has a number of advantages:

• All control is directly under the control of the program, so changes can be readily

implemented.

• The order in which devices are serviced is determined by the program, this order

is not necessarily fixed but can be altered by the program, as necessary. This

means that the “priority” of a device can be varied under program control. (The

“priority” of a determines which of a set of devices which are simultaneously

ready for servicing will actually be serviced first).

• It is relatively easy to add or delete devices.

Perhaps the chief disadvantage of program-controlled I/O is that a great deal of time may

be spent testing the status inputs of the I/O devices, when the devices do not need

servicing. This “busy wait” or “wait loop” during which the I/O devices are polled but no

I/O operations are performed is really time wasted by the processor, if there is other work

which could be done at that time. Also, if a particular device has its data available for

only a short time, the data may be missed because the input was not tested at the

appropriate time.

Program controlled I/O is often used for simple operations which must be performed

sequentially. For example, the following may be used to control the temperature in a

room:

DO forever

INPUT temperature

IF (temperature < setpoint) THEN

turn heat ON ELSE

turn heat OFF

END IF

Note here that the order of events is fixed in time, and that the program loops forever.

(It is really waiting for a change in the temperature, but it is a “busy wait.”)


1. Write the advantages of program-controlled I/O

Interrupt-controlled I/O

Interrupt-controlled I/O reduces the severity of the two problems mentioned for program-

controlled I/O by allowing the I/O device itself to initiate the device service routine in the

processor. This is accomplished by having the I/O device generate an interrupt signal

which is tested directly by the hardware of the CPU. When the interrupt input to the CPU

is found to be active, the CPU itself initiates a subprogram call to somewhere in the

memory of the processor; the particular address to which the processor branches on an

interrupt depends on the interrupt facilities available in the processor.

The simplest type of interrupt facility is where the processor executes a subprogram

branch to some specific address whenever an interrupt input is detected by the CPU. The

return address (the location of the next instruction in the program that was interrupted) is

saved by the processor as part of the interrupt process.

If there are several devices which are capable of interrupting the processor, then with this

simple interrupt scheme the interrupt handling routine must examine each device to

determine which one caused the interrupt. Also, since only one interrupt can be handled

at a time, there is usually a hardware “priority encoder” which allows the device with the

highest priority to interrupt the processor, if several devices attempt to interrupt the

processor simultaneously. In Figure -3, the “handshake out” outputs would be connected

to a priority encoder to implement this type of I/O. the other connections remain the

same. (Some systems use a “daisy chain” priority system to determine which of the

interrupting devices is serviced first. “Daisy chain” priority resolution is discussed later.)

In most modern processors, interrupt return points are saved on a “stack” in memory, in

the same way as return addresses for subprogram calls are saved. In fact, an interrupt can

often be thought of as a subprogram which is invoked by an external device. If a stack is

used to save the return address for interrupts, it is then possible to allow one interrupt the

interrupt handling routine of another interrupt. In modern computer systems, there are

often several “priority levels” of interrupts, each of which can be disabled, or “masked.”

There is usually one type of interrupt input which cannot be disabled (a non-maskable

interrupt) which has priority over all other interrupts. This interrupt input is used for

warning the processor of potentially catastrophic events such as an imminent power

failure, to allow the processor to shut down in an orderly way and to save as much

information as possible.

Most modern computers make use of “vectored interrupts.” With vectored interrupts, it is

the responsibility of the interrupting device to provide the address in main memory of the

interrupt servicing routine for that device. This means, of course, that the I/O device itself

must have sufficient “intelligence” to provide this address when requested by the CPU,

and also to be initially “programmed” with this address information by the processor.

Although somewhat more complex than the simple interrupt system described earlier,

vectored interrupts provide such a significant advantage in interrupt handling speed and

ease of implementation (i.e., a separate routine for each device) that this method is almost

universally used on modern computer systems.

Some processors have a number of special inputs for vectored interrupts (each acting

much like the simple interrupt described earlier). Others require that the interrupting

device itself provide the interrupt address as part of the process of interrupting the

processor.

Direct Memory Access

In most mini- and mainframe computer systems, a great deal of input and output occurs

between the disk system and the processor. It would be very inefficient to perform these

operations directly through the processor; it is much more efficient if such devices, which

can transfer data at a very high rate, place the data directly into the memory, or take the

data directly from the processor without direct intervention from the processor. I/O

performed in this way is usually called direct memory access, or DMA. The controller for

a device employing DMA must have the capability of generating address signals for the

memory, as well as all of the memory control signals. The processor informs the DMA

controller that data is available (or is to be placed into) a block of memory locations

starting at a certain address in memory. The controller is also informed of the length of

the data block.

There are two possibilities for the timing of the data transfer from the DMA controller to

memory:

• The controller can cause the processor to halt if it attempts to access data in the

same bank of memory into which the controller is writing. This is the fatest option

for the I/O device, but may cause the processor to run more slowly because the

processor may have to wait until a full block of data is transferred.

• The controller can access memory in memory cycles which are not used by the

particular bank of memory into which the DMA controller is writing data. This

approach, called “cycle stealing,” is perhaps the most commonly used approach.

(In a processor with a cache that has a high hit rate this approach may not slow

the I/O transfer significantly).

DMA is a sensible approach for devices which have the capability of transferring blocks

of data at a very high data rate, in short bursts. It is not worthwhile for slow devices, or

for devices which do not provide the processor with large quantities of data. Because the

controller for a DMA device is quite sophisticated, the DMA devices themselves are

usually quite sophisticated (and expensive) compared to other types of I/O devices.

One problem that systems employing several DMA devices have to address is the

contention for the single system bus. There must be some method of selecting which

device controls the bus (acts as “bus master”) at any given time. There are many ways of

addressing the “bus arbitration” problem; three techniques which are often implemented

in processor systems are the following (these are also often used to determine the

priorities of other events which may occur simultaneously, like interrupts). They rely on

the use of at least two signals (bus_request and bus_grant), used in a manner similar to

the two-wire handshake:

Daisy chain arbitration Here, the requesting device or devices assert the signal

bus_request. The bus arbiter returns the bus_grant signal, which passes through each of

the devices which can have access to the bus, as shown in Figure - 4. Here, the priority of

a device depends solely on its position in the daisy chain. If two or more devices request

the bus at the same time, the highest priority device is granted the bus first, then the

bus_grant signal is passed further down the chain. Generally a third signal (bus_release)

is used to indicate to the bus arbiter that the first device has finished its use of the bus.

Holding bus_request asserted indicates that another device wants to use the bus.

Figure 4:

Daisy chain bus arbitration

Priority encoded arbitration Here, each device has a request line connected to a

centralized arbiter that determines which device will be granted access to the bus. The

order may be fixed by the order of connection (priority encoded), or it may be determined

by some algorithm preloaded into the arbiter. Figure - 5 shows this type of system. Note

that each device has a separate line to the bus arbiter. (The bus_grant signals have been

omitted for clarity.)

Figure 5:

Priority encoded bus arbitration

Distributed arbitration by self-selection Here, the devices themselves determine which of

them has the highest priority. Each device has a bus_request line or lines on which it

places a code identifying itself. Each device examines the codes for all the requesting

devices, and determines whether or not it is the highest priority requesting device.

These arbitration schemes may also be used in conjunction with each other. For example,

a set of similar devices may be daisy chained together, and this set may be an input to a

priority encoded scheme.

Using interrupts driven device drivers to transfer data to or from hardware devices works

well when the amount of data is reasonably low. For example a 9600 baud modem can

transfer approximately one character every millisecond (’th second).

Figure – 6

If the interrupt latency, the amount of time that it takes between the hardware device

raising the interrupt and the device driver’s interrupt handling routine being called, is low

(say 2 milliseconds) then the overall system impact of the data transfer is very low. The

9600 baud modem data transfer would only take 0.002% of the CPU’s processing time.

For high speed devices, such as hard disk controllers or ethernet devices the data transfer

rate is a lot higher. A SCSI device can transfer up to 40 Mbytes of information per

second.

Direct Memory Access, or DMA, was invented to solve this problem. A DMA controller

allows devices to transfer data to or from the system’s memory without the intervention

of the processor. A PC’s ISA DMA controller has 8 DMA channels of which 7 are

available for use by the device drivers. Each DMA channel has associated with it a 16 bit

address register and a 16 bit count register. To initiate a data transfer the device driver

sets up the DMA channel’s address and count registers together with the direction of the

data transfer, read or write. It then tells the device that it may start the DMA when it

wishes. When the transfer is complete the device interrupts the PC. Whilst the transfer is

taking place the CPU is free to do other things.

Device drivers have to be careful when using DMA. First of all the DMA controller

knows nothing of virtual memory, it only has access to the physical memory in the

system. Therefore the memory that is being DMA’d to or from must be a contiguous

block of physical memory. This means that you cannot DMA directly into the virtual

address space of a process. You can however lock the processes physical pages into

memory, preventing them from being swapped out to the swap device during a DMA

operation. Secondly, the DMA controller cannot access the whole of physical memory.

The DMA channel’s address register represents the first 16 bits of the DMA address, the

next 8 bits come from the page register. This means that DMA requests are limited to the

bottom 16 Mbytes of memory.

DMA channels are scares resources, there are only 7 of them, and they cannot be shared

between device drivers. Just like interrupts the device driver must be able to work out

which DMA channel it should use. Like interrupts, some devices have a fixed DMA

channel. The floppy device, for example, always uses DMA channel 2. Sometimes the

DMA channel for a device can be set by jumpers, a number of Ethernet devices use this

technique. The more flexible devices can be told (via their CSRs) which DMA channels

to use and, in this case, the device driver can simple pick a free DMA channel to use.


1. What do you mean by direct access memory?

2. Explain the two possibilities for the timing of the data transfer from the DMA

controller to memory.

The I/O address space

Some processors map I/O devices in their own, separate, address space; others use

memory addresses as addresses of I/O ports. Both approaches have advantages and

disadvantages. The advantages of a separate address space for I/O devices are, primarily,

that the I/O operations would then be performed by separate I/O instructions, and that all

the memory address space could be dedicated to memory.

Typically, however, I/O is only a small fraction of the operations performed by a

computer system; generally less than 1 percent of all instructions are I/O instructions in a

program. It may not be worthwhile to support such infrequent operations with a rich

instruction set, so I/O instructions are often rather restricted.

In processors with memory mapped I/O, any of the instructions which references memory

directly can also be used to reference I/O ports, including instructions which modify the

contents of the I/O port (e.g., arithmetic instructions.)

Some problems can arise with memory mapped I/O in systems which use cache memory

or virtual memory. If a processor uses a virtual memory mapping, and the I/O ports are

allowed to be in a virtual address space, the mapping to the physical device may not be

consistent if there is a context switch. Moreover. the device would have to be capable of

performing the virtual-to-physical mapping. If physical addressing is used, mapping

across page boundaries may be problematic.

If the memory locations are cached, then the value in cache may not be consistent with

the new value loaded in memory. Generally, either there is some method for invalidating

cache that may be mapped to I/O addresses, or the I/O addresses are not cached at all. We

will look at the general problem of maintaining cache in a consistent state (the cache

coherency problem) in more detail when we discuss multi-processor systems.

Terminal Questions

1. What is the significance of I/O Operations?

2. Draw a block diagram of an I/O structure and discuss the working principle.

3. What are various I/O control strategies ? Discuss in brief.

4. Explain programmed I/O and interrupt I/O. How they differ?

5. Discuss the concept of Direct Memory Access. What are its advantages over other

methods?

Unit 10 : Case Study on Window Operating Systems :

In this units covers the covers the architecture of the Win NT OS, Win2000, Common

functionality to handles the different activities. And it coves the service family

functionality. Discussed the different versions of OS.

Introduction

Windows 2000, Windows XP and Windows Server 2003 are all part of the Windows NT

family of Microsoft operating systems. They are all preemptive, reentrant operating

systems, which have been designed to work with either uniprocessor- or symmetrical

multi processor (SMP)-based Intel x86 computers. To process input/output (I/O) requests

it uses packet-driven I/O which utilises I/O request packets (IRPs) and asynchronous I/O.

Starting with Windows XP, Microsoft began building in 64-bit support into their

operating systems – before this their operating systems were based on a 32-bit model.

The architecture of the Windows NT operating system line is highly modular, and

consists of two main layers: a user mode and a kernel mode. Programs and subsystems in

user mode are limited in terms of what system resources they have access to, while the

kernel mode has unrestricted access to the system memory and external devices. The

kernels of the operating systems in this line are all known as hybrid kernels as their

microkernel is essentially the kernel, while higher-level services are implemented by the

executive, which exists in kernel mode.

Objective:

At the end of this unit you will be understand the:

• Architectural details of Windows NT

• Functionality and operations of Windows NT

• Services and functionality of Windows NT Operating Systems

• Deployment related issues in Windows NT

Architecture of the Windows NT operating system line

The Windows NT operating system family’s architecture consists of two layers (user

mode and kernel mode), with many different modules within both of these layers.

User mode in the Windows NT line is made of subsystems capable of passing I/O

requests to the appropriate kernel mode software drivers by using the I/O manager. Two

subsystems make up the user mode layer of Windows 2000: the Environment subsystem

(runs applications written for many different types of operating systems), and the Integral

subsystem (operates system specific functions on behalf of the environment subsystem).

Kernel mode in Windows 2000 has full access to the hardware and system resources of

the computer. The kernel mode stops user mode services and applications from accessing

critical areas of the operating system that they should not have access to.

The Executive interfaces with all the user mode subsystems. It deals with I/O, object

management, security and process management. The hybrid kernel sits between the

Hardware Abstraction Layer and the Executive to provide multiprocessor

synchronization, thread and interrupt scheduling and dispatching, and trap handling and

exception dispatching. The microkernel is also responsible for initializing device drivers

at bootup. Kernel mode drivers exist in three levels: highest level drivers, intermediate

drivers and low level drivers. Windows Driver Model (WDM) exists in the intermediate

layer and was mainly designed to be binary and source compatible between Windows 98

and Windows 2000. The lowest level drivers are either legacy Windows NT device

drivers that control a device directly or can be a PnP hardware bus.

User mode

The user mode is made up of subsystems which can pass I/O requests to the appropriate

kernel mode drivers via the I/O manager (which exists in kernel mode). Two subsystems

make up the user mode layer of Windows 2000: the Environment subsystem and the

Integral subsystem.

The environment subsystem was designed to run applications written for many different

types of operating systems. None of the environment subsystems can directly access

hardware, and must request access to memory resources through the Virtual Memory

Manager that runs in kernel mode. Also, applications run at a lower priority than kernel

mode processes. Currently, there are three main environment subsystems: the Win32

subsystem, an OS/2 subsystem and a POSIX subsystem.

The Win32 environment subsystem can run 32-bit Windows applications. It contains the

console as well as text window support, shutdown and hard-error handling for all other

environment subsystems. It also supports Virtual DOS Machines (VDMs), which allow

MS-DOS and 16-bit Windows 3.x (Win16) applications to be run on Windows. There is a

specific MS-DOS VDM which runs in its own address space and which emulates an Intel

80486 running MS-DOS 5. Win16 programs, however, run in a Win16 VDM. Each

program, by default, runs in the same process, thus using the same address space, and the

Win16 VDM gives each program its own thread to run on. However, Windows 2000 does

allow users to run a Win16 program in a separate Win16 VDM, which allows the

program to be preemptively multitasked as Windows 2000 will pre-empt the whole VDM

process, which only contains one running application. The OS/2 environment subsystem

supports 16-bit character-based OS/2 applications and emulates OS/2 1.x, but not 2.x or

later OS/2 applications. The POSIX environment subsystem supports applications that

are strictly written to either the POSIX.1 standard or the related ISO/IEC standards.

The integral subsystem looks after operating system specific functions on behalf of the

environment subsystem. It consists of a security subsystem, a workstation service and a

server service. The security subsystem deals with security tokens, grants or denies access

to user accounts based on resource permissions, handles logon requests and initiates

logon authentication, and determines which system resources need to be audited by

Windows 2000. It also looks after Active Directory. The workstation service is an API to

the network redirector, which provides the computer access to the network. The server

service is an API that allows the computer to provide network services.

Kernel mode

Windows 2000 kernel mode has full access to the hardware and system resources of the

computer and runs code in a protected memory area. It controls access to scheduling,

thread prioritization, memory management and the interaction with hardware. The kernel

mode stops user mode services and applications from accessing critical areas of the

operating system that they should not have access to as user mode processes ask the

kernel mode to perform such operations on its behalf.

Kernel mode consists of executive services, which are it made up on many modules that

do specific tasks, kernel drivers, a microkernel and a Hardware Abstraction Layer, or

HAL.

• Executive

The Executive interfaces with all the user mode subsystems. It deals with I/O, object

management, security and process management. It contains various components,

including the I/O Manager, the Security Reference Monitor, the Object Manager, the IPC

Manager, the Virtual Memory Manager (VMM), a PnP Manager and Power Manager,

as well as a Window Manager which works in conjunction with the Windows Graphics

Device Interface (GDI). Each of these components exports a kernel-only support routine

allows other components to communicate with one another. Grouped together, the

components can be called executive services. No executive component has access to the

internal routines of any other executive component.

Each object in Windows 2000 exists in its own namespace. This is a screenshot from

SysInternals’ WinObj

The object manager is a special executive subsystem that all other executive subsystems

must pass through to gain access to Windows 2000 resources – essentially making it a

resource management infrastructure service. The object manager is used to reduce the

duplication of object resource management functionality in other executive subsystems,

which could potentially lead to bugs and make development of Windows 2000 harder. To

the object manager, each resource is an object, whether that resource is a physical

resource (such as a file system or peripheral) or a logical resource (such as a file). Each

object has a structure or object type that the object manager must know about. When

another executive subsystem requests the creation of an object, they send that request to

the object manager which creates an empty object structure which the requesting

executive subsystem then fills in. Object types define the object procedures and any data

specific to the object. In this way, the object manager allows Windows 2000 to be an

object oriented operating system, as object types can be thought of as classes that define

objects.

Each instance of an object that is created stores its name, parameters that are passed to

the object creation function, security attributes and a pointer to its object type. The object

also contains an object close procedure and a reference count to tell the object manager

how many other objects in the system reference that object and thereby determines

whether the object can be destroyed when a close request is sent to it. Every object exists

in a hierarchical object namespace.

Further executive subsystems are the following:

(i) I/O Manager: allows devices to communicate with user-mode subsystems. It

translates user-mode read and write commands in read or write IRPs which it passes to

device drivers. It accepts file system I/O requests and translates them into device specific

calls, and can incorporate low-level device drivers that directly manipulate hardware to

either read input or write output. It also includes a cache manager to improve disk

performance by caching read requests and write to the disk in the background

(ii) Security Reference Monitor (SRM): the primary authority for enforcing the security

rules of the security integral subsystem. It determines whether an object or resource can

be accessed, via the use of access control lists (ACLs), which are themselves made up of

access control entries (ACEs). ACEs contain a security identifier (SID) and a list of

operations that the ACE gives a select group of trustees – a user account, group account,

or logon session – permission (allow, deny, or audit) to that resource.

(iii) IPC Manager: short for Interprocess Communication Manager, this manages the

communication between clients (the environment subsystem) and servers (components of

the Executive). It can use two facilities: the Local Procedure Call (LPC) facility (clients

and servers on the one computer) and the Remote Procedure Call (RPC) facility (where

clients and servers are situated on different computers. Microsoft has had significant

security issues with the RPC facility.

(iv) Virtual Memory Manager: manages virtual memory, allowing Windows 2000 to

use the hard disk as a primary storage device (although strictly speaking it is secondary

storage). It controls the paging of memory in and out of physical memory to disk storage.

(v) Process Manager: handles process and thread creation and termination

(vi) PnP Manager: handles Plug and Play and supports device detection and installation

at boot time. It also has the responsibility to stop and start devices on demand –

sometimes this happens when a bus gains a new device and needs to have a device driver

loaded to support that device. Both FireWire and USB are hot-swappable and require the

services of the PnP Manager to load, stop and start devices. The PnP manager interfaces

with the HAL, the rest of the executive (as necessary) and with device drivers.

(vii) Power Manager: the power manager deals with power events and generates power

IRPs. It coordinates these power events when several devices send a request to be turned

off it determines the best way of doing this.

The display system has been moved from user mode into the kernel mode as a device

driver contained in the file Win32k.sys. There are two components in this device driver –

the Window Manager and the GDI:

(viii) Window Manager: responsible for drawing windows and menus. It controls the

way that output is painted to the screen and handles input events (such as from the

keyboard and mouse), then passes messages to the applications that need to receive this

input

(ix) GDI: the Graphics Device Interface is responsible for tasks such as drawing lines

and curves, rendering fonts and handling palettes. Windows 2000 introduced native alpha

blending into the GDI.

(x) Microkernel & kernel-mode drivers

The Microkernel sits between the HAL and the Executive and provides multiprocessor

synchronization, thread and interrupt scheduling and dispatching, and trap handling and

exception dispatching. The Microkernel often interfaces with the process manager. The

microkernel is also responsible for initializing device drivers at bootup that are necessary

to get the operating system up and running.

Windows 2000 uses kernel-mode device drivers to enable it to interact with hardware

devices. Each of the drivers has well defined system routines and internal routines that it

exports to the rest of the operating system. All devices are seen by user mode code as a

file object in the I/O manager, though to the I/O manager itself the devices are seen as

device objects, which it defines as either file, device or driver objects. Kernel mode

drivers exist in three levels: highest level drivers, intermediate drivers and low level

drivers. The highest level drivers, such as file system drivers for FAT and NTFS, rely on

intermediate drivers. Intermediate drivers consist of function drivers – or main driver for

a device – that are optionally sandwiched between lower and higher level filter drivers.

The function driver then relies on a bus driver – or a driver that services a bus controller,

adapter, or bridge – which can have an optional bus filter driver that sits between itself

and the function driver. Intermediate drivers rely on the lowest level drivers to function.

The Windows Driver Model (WDM) exists in the intermediate layer. The lowest level

drivers are either legacy Windows NT device drivers that control a device directly or can

be a PnP hardware bus. These lower level drivers directly control hardware and do not

rely on any other drivers.

(xi) Hardware abstraction layer

The Windows 2000 Hardware Abstraction Layer, or HAL, is a layer between the physical

hardware of the computer and the rest of the operating system. It was designed to hide

differences in hardware and therefore provide a consistent platform on which applications

may run. The HAL includes hardware specific code that controls I/O interfaces, interrupt

controllers and multiple processors.

Windows 2000 was designed to support the 64-bit DEC Alpha. After Compaq announced

they would discontinue support of the processor, Microsoft stopped releasing tests build

of Windows 2000 for AXP to the public, stopping with beta 3. Development of Windows

on the Alpha continued internally in order to continue to have a 64-bit architecture

development model ready until the wider availability of the Intel Itanium IA-64

architecture. The HAL now only supports hardware that is compatible with the Intel x86

architecture.

Microsoft has had numerous security issues caused by vulnerabilities in its RPC

mechanisms. A list follows of the security bulletins that Microsoft have issued in regards

to RPC vulnerabilities:

Microsoft Security Bulletin MS03-026: issue with a vulnerability in the part of RPC that

deals with message exchange over TCP/IP. The failure results because of incorrect

handling of malformed messages. This particular vulnerability affects a Distributed

Component Object Model (DCOM) interface with RPC, which listens on RPC enabled

ports.

Microsoft Security Bulletin MS03-001: A security vulnerability results from an

unchecked buffer in the Locator service. By sending a specially malformed request to the

Locator service, an attacker could cause the Locator service to fail, or to run code of the

attacker’s choice on the system.

Microsoft Security Bulletin MS03-026: Buffer overrun in RPC may allow code execution

Microsoft Security Bulletin MS03-010: This particular vulnerabilty affects the RPC

Endpoint Mapper process, which listens on TCP/IP port 135. The RPC endpoint mapper

allows RPC clients to determine the port number currently assigned to a particular RPC

service. To exploit this vulnerability, an attacker would need to establish a TCP/IP

connection to the Endpoint Mapper process on a remote machine. Once the connection

was established, the attacker would begin the RPC connection negotiation before

transmitting a malformed message. At this point, the process on the remote machine

would fail. The RPC Endpoint Mapper process is responsible for maintaining the

connection information for all of the processes on that machine using RPC. Because the

Endpoint Mapper runs within the RPC service itself, exploiting this vulnerability would

cause the RPC service to fail, with the attendant loss of any RPC-based services the

server offers, as well as potential loss of some COM functions.

Microsoft Security Bulletin MS04-029: This RPC Runtime library vulnerability was

addressed in CAN-2004-0569, however the title is “Vulnerability in RPC Runtime

Library Could Allow Information Disclosure and Denial of Service”.

Microsoft Security Bulletin (MS00-066): A remote denial of service vulnerability in RPC

is found. Blocking ports 135-139 and 445 can stop attacks.

Microsoft Security Bulletin MS03-039: “There are three newly identified vulnerabilities

in the part of RPCSS Service that deals with RPC messages for DCOM activation- two

that could allow arbitrary code execution and one that could result in a denial of service.

The flaws result from incorrect handling of malformed messages. These particular

vulnerabilities affect the Distributed Component Object Model (DCOM) interface within

the RPCSS Service. This interface handles DCOM object activation requests that are sent

from one machine to another. An attacker who successfully exploited these

vulnerabilities could be able to run code with Local System privileges on an affected

system, or could cause the RPCSS Service to fail. The attacker could then be able to take

any action on the system, including installing programs, viewing, changing or deleting

data, or creating new accounts with full privileges. To exploit these vulnerabilities, an

attacker could create a program to send a malformed RPC message to a vulnerable

system targeting the RPCSS Service.”

Microsoft Security Bulletin MS01-041: “Several of the RPC servers associated with

system services in Microsoft Exchange Server, SQL Server, Windows NT 4.0 and

Windows 2000 do not adequately validate inputs, and in some cases will accept invalid

inputs that prevent normal processing. The specific input values at issue here vary from

RPC server to RPC server. An attacker who sent such inputs to an affected RPC server

could disrupt its service. The precise type of disruption would depend on the specific

service, but could range in effect from minor (e.g., the service temporarily hanging) to

major (e.g., the service failing in a way that would require the entire system to be

restarted).”

Windows 2000

Windows 2000 (also referred to as Win2K or W2K) is a preemptible and interruptible,

graphical, business-oriented operating system that was designed to work with either

uniprocessor or symmetric multi-processor (SMP) 32-bit Intel x86 computers. It is part of

the Microsoft Windows NT line of operating systems and was released on February 17,

2000. Windows 2000 comes in four versions: Professional, Server, Advanced Server, and

Datacenter Server. Additionally, Microsoft offers Windows 2000 Advanced Server-

Limited Edition, which was released in 2001 and runs on 64-bit Intel Itanium

microprocessors. Windows 2000 is classified as a hybrid-kernel operating system, and its

architecture is divided into two modes: user mode and kernel mode. The kernel mode

provides unrestricted access to system resources and facilitates the user mode, which is

heavily restricted and designed for most applications.

All versions of Windows 2000 have common functionality, including many system

utilities such as the Microsoft Management Console (MMC) and standard system

management applications such as a disk defragmentation utility. Support for people with

disabilities has also been improved by Microsoft across their Windows 2000 line, and

they have included increased support for different languages and locale information. All

versions of the operating system support the Windows NT filesystem, NTFS 5, the

Encrypted File System (EFS), as well as basic and dynamic disk storage. Dynamic disk

storage allows different types of volumes to be used. The Windows 2000 Server family

has enhanced functionality, including the ability to provide Active Directory services (a

hierarchical framework of resources), Distributed File System (a file system that supports

sharing of files) and fault-redundant storage volumes.

Windows 2000 can be installed and deployed to an enterprise through either an attended

or unattended installation. Unattended installations rely on the use of answer files to fill

in installation information, and can be performed through a bootable CD using Microsoft

Systems Management Server (SMS), by the System Preparation Tool (Sysprep).

History

Windows 2000 originally descended from the Microsoft Windows NT operating system

product line. Originally called Windows NT 5, Microsoft changed the name to Windows

2000 on October 27, 1998. It was also the first Windows version that was released

without a code name, though Windows 2000 Service Pack 1 was codenamed “Asteroid”

and Windows 2000 64-bit was codenamed “Janus” (not to be confused with Windows

3.1, which had the same codename). The first beta for Windows 2000 was released on

September 27, 1997 and several further betas were released until Beta 3 which was

released on April 29, 1999. From here, Microsoft issued three release candidates between

July and November 1999, and finally released the operating system to partners on

December 12, 1999. The public received the full version of Windows 2000 on February

17, 2000 and the press immediately hailed it as the most stable operating system

Microsoft had ever released. Novell, however, was not so impressed with Microsoft’s

new directory service architecture as they found it to be less scalable or reliable than their

own Novell Directory Services (NDS) technology. On September 29, 2000, Microsoft

released Windows 2000 Datacenter. Microsoft released Service Pack 1 (SP1) on August

15, 2000, Service Pack 2 (SP2) on May 16, 2001, Service Pack 3 (SP3) on August 29,

2002 and its last Service Pack (SP4) on June 26, 2003. Microsoft has stated that they will

not release a Service Pack 5, but instead, have offered an “Update Rollup” for Service

Pack 4. Microsoft phased out all development of their Java Virtual Machine (JVM) from

Windows 2000 in Service Pack 3.

Windows 2000 has since been superseded by newer Microsoft operating systems.

Microsoft has replaced Windows 2000 Server products with Windows Server 2003, and

Windows 2000 Professional with Windows XP Professional. Windows Neptune started

development in 1999, and was supposed to be the home-user edition of Windows 2000.

However, the project lagged in production time – and only one alpha release was built.

Windows Me was released as a substitute, and the Neptune project was forwarded to the

production of Whistler (Windows XP). The only elements of the Windows project which

were included in Windows 2000 were the ability to upgrade from Windows 95 or

Windows 98, and support for the FAT32 file system.

Several notable security flaws have been found in Windows 2000. Code Red and Code

Red II were famous (and highly visible to the worldwide press) computer worms that

exploited vulnerabilities of the indexing service of Windows 2000’s Internet Information

Services (IIS). In August 2003, two major worms named the Sobig worm and the Blaster

worm began to attack millions of Microsoft Windows computers, resulting in the largest

down-time and clean-up cost ever.

Architecture

The Windows 2000 operating system architecture consists of two layers (user mode and

kernel mode), with many different modules within both of these layers.

Windows 2000 is a highly modular system that consists of two main layers: a user mode

and a kernel mode. The user mode refers to the mode in which user programs are run.

Such programs are limited in terms of what system resources they have access to, while

the kernel mode has unrestricted access to the system memory and external devices. All

user mode applications access system resources through the executive which runs in

kernel mode.

User mode

User mode in Windows 2000 is made of subsystems capable of passing I/O requests to

the appropriate kernel mode drivers by using the I/O manager. Two subsystems make up

the user mode layer of Windows 2000: the environment subsystem and the integral

subsystem.

The environment subsystem was designed to run applications written for many different

types of operating systems. These applications, however, run at a lower priority than

kernel mode processes. There are three main environment subsystems:

Win32 subsystem runs 32-bit Windows applications and also supports Virtual DOS

Machines (VDMs), which allows MS-DOS and 16-bit Windows 3.x (Win16) applications

to run on Windows.

OS/2 environment subsystem supports 16-bit character-based OS/2 applications and

emulates OS/2 1.3 and 1.x, but not 2.x or later OS/2 applications.

POSIX environment subsystem supports applications that are strictly written to either the

POSIX.1 standard or the related ISO/IEC standards.

The integral subsystem looks after operating system specific functions on behalf of the

environment subsystem. It consists of a security subsystem (grants/denies access and

handles logons), workstation service (helps the computer gain network access) and a

server service (lets the computer provide network services).

Kernel mode

Kernel mode in Windows 2000 has full access to the hardware and system resources of

the computer. The kernel mode stops user mode services and applications from accessing

critical areas of the operating system that they should not have access to.

Each object in Windows 2000 exists in its own namespace. This is a screenshot from

SysInternal’s

WinObj

The executive interfaces with all the user mode subsystems. It deals with I/O, object

management, security and process management. It contains various components,

including:

Object manager: a special executive subsystem that all other executive subsystems must

pass through to gain access to Windows 2000 resources. This essentially is a resource

management infrastructure service that allows Windows 2000 to be an object oriented

operating system.

I/O Manager: allows devices to communicate with user-mode subsystems by translating

user-mode read and write commands and passing them to device drivers.

Security Reference Monitor (SRM): the primary authority for enforcing the security

rules of the security integral subsystem.

IPC Manager: short for Interprocess Communication Manager, manages the

communication between clients (the environment subsystem) and servers (components of

the executive).

Virtual Memory Manager: manages virtual memory, allowing Windows 2000 to use

the hard disk as a primary storage device (although strictly speaking it is secondary

storage).

Process Manager: handles process and thread creation and termination

PnP Manager: handles Plug and Play and supports device detection and installation at

boot time.

Power Manager: the power manager coordinates power events and generates power

IRPs.

The display system is handled by a device driver contained in Win32k.sys. The Window

Manager component of this driver is responsible for drawing windows and menus while

the GDI (graphical device interface) component is responsible for tasks such as drawing

lines and curves, rendering fonts and handling palettes.

The Windows 2000 Hardware Abstraction Layer, or HAL, is a layer between the physical

hardware of the computer and the rest of the operating system. It was designed to hide

differences in hardware and therefore provide a consistent platform to run applications

on. The HAL includes hardware specific code that controls I/O interfaces, interrupt

controllers and multiple processors.

The microkernel sits between the HAL and the executive and provides multiprocessor

synchronization, thread and interrupt scheduling and dispatching, trap handling and

exception dispatching. The microkernel often interfaces with the process manager. The

microkernel is also responsible for initializing device drivers at bootup that are necessary

to get the operating system up and running.

Common functionality

Certain features are common across all versions of Windows 2000 (both Professional and

the Server versions), among them being NTFS 5, the Microsoft Management Console

(MMC), the Encrypting File System (EFS), dynamic and basic disk storage, usability

enhancements and multi-language and locale support. Windows 2000 also has several

standard system utilities included as standard. As well as these features, Microsoft

introduced a new feature to protect critical system files, called Windows File Protection

(WFP). This prevents programs (with the exception of Microsoft’s update programs)

from replacing critical Windows system files and thus making the system inoperable.

Microsoft recognised that the infamous Blue Screen of Death (or stop error) could cause

serious problems for servers that needed to be constantly running and so provided a

system setting that would allow the server to automatically reboot when a stop error

occurred. Users have the option of dumping the first 64KB of memory to disk (the

smallest amount of memory that is useful for debugging purposes, also known as a

minidump), a dump of only the kernel’s memory or a dump of the entire contents of

memory to disk, as well as write that this event happened to the Windows 2000 event log.

In order to improve performance on computers running Windows 2000 as a server

operating system, Microsoft gave administrators the choice of optimising the operating

system for background services or for applications.

NTFS 5

Windows 2000 supports disk quotas, which can be set via the “Quotas” tab found in the

hard disk properties dialog box.

Microsoft released the third version of the NT File System (NTFS) – also known as

version 5.0 – in Windows 2000; this introduced quotas, file-system-level encryption

(called EFS), sparse files and reparse points. Sparse files allow for the efficient storage of

data sets that are very large yet contain many areas that only have zeroes. Reparse points

allow the object manager to reset a file namespace lookup and let file system drivers

implement changed functionality in a transparent manner. Reparse points are used to

implement Volume Mount Points, Directory Junctions, Hierarchical Storage

Management, Native Structured Storage and Single Instance Storage. Volume mount

points and directory junctions allow for a file to be transparently referred from one file or

directory location to another.

Encrypting File System

The Encrypting File System (EFS) introduced strong encryption into the Windows file

world. It allowed any folder or drive on an NTFS volume to be encrypted transparently to

the end user. EFS works in conjunction with the EFS service, Microsoft’s CryptoAPI and

the EFS File System Run-Time Library (FSRTL). As of February 2004, its encryption

has not been compromised.

EFS works by encrypting a file with a bulk symmetric key (also known as the File

Encryption Key, or FEK), which is used because it takes a relatively smaller amount of

time to encrypt and decrypt large amounts of data than if an asymmetric key cipher is

used. The symmetric key that is used to encrypt the file is then encrypted with a public

key that is associated with the user who encrypted the file, and this encrypted data is

stored in the header of the encrypted file. To decrypt the file, the file system uses the

private key of the user to decrypt the symmetric key that is stored in the file header. It

then uses the symmetric key to decrypt the file. Because this is done at the file system

level, it is transparent to the user. Also, in case of a user losing access to their key,

support for recovery agents that can decrypt files has been built in to the EFS system.

Basic and dynamic disk storage

Windows 2000 introduced the Logical Disk Manager for dynamic storage. All versions

of Windows 2000 support three types of dynamic disk volumes (along with basic

storage): simple volumes, spanned volumes and striped volumes:

Simple volume: this is a volume with disk space from one disk.

Spanned volumes: multiple disks spanning up to 32 disks. If one disk fails, all data in

the volume is lost.

Striped volumes: also known as RAID-0, a striped volume stores all its data across

several disks in stripes. This allows better performance because disk read and writes are

balanced across multiple disks. Windows 2000 also added support for iSCSI protocol.

Accessibility support

The Windows 2000 onscreen keyboard map allows users who have problems with using

the keyboard to use a mouse to input text.

Microsoft made an effort to increase the usability of Windows 2000 for people with

visual and auditory impairments and other disabilities. They included several utilities

designed to make the system more accessible:

FilterKeys: These are a group of keyboard related support for people with typing issues,

and include:

SlowKeys: Windows is told to disregard keystrokes that are not held down for a certain

time period

BounceKeys: multiple keystrokes to one key to be ignored within a certain timeframe

RepeatKeys: allows users to slow down the rate at which keys are repeated via the

keyboard’s keyrepeat feature

ToggleKeys: when turned on, Windows will play a sound when either the CAPS LOCK,

NUM LOCK or SCROLL LOCK keys are pressed

MouseKeys: allows the cursor to be moved around the screen via the numeric keypad

instead of the mouse

On screen keyboard: assists those who are not familiar with a given keyboard by

allowing them to use a mouse to enter characters to the screen

SerialKeys: gives Windows 2000 the ability to support speech augmentation devices

StickyKeys: makes modifier keys (ALT, CTRL and SHIFT) become “sticky” – in other

words a user can press the modifier key, release that key and then press the combination

key. Normally the modifier key must remain pressed down to activate the sequence.

On screen magnifier: assists users with visual impairments by magnifying the part of

the screen they place their mouse over.

Narrator: Microsoft Narrator assists users with visual impairments with system

messages, as when these appear the narrator will read this out via the sound system

High contrast theme: to assist users with visual impairments

SoundSentry: designed to help users with auditory impairments, Windows 2000 will

show a visual effect when a sound is played through the sound system

Language & locale support

Windows 2000 has support for many languages other than English. It supports Arabic,

Armenian, Baltic, Central European, Cyrillic, Georgian, Greek, Hebrew, Indic, Japanese,

Korean, Simplified Chinese, Thai, Traditional Chinese, Turkic, Vietnamese and Western

European languages. It also has support for many different locales, a list of which can be

found on Microsoft’s website.

System utilities

The Microsoft Management Console (MMC) is used for administering Windows 2000

computers.

Windows 2000 introduced the Microsoft Management Console (MMC), which is used to

create, save, and open administrative tools. Each of the tools is called a console, and most

consoles allow an administrator to administer other Windows 2000 computers from one

centralised computer. Each console can contain one or many specific administrative

tools, called snap-ins. Snap-ins can be either standalone (performs one function), or

extensions (adds functionality to an existing snap-in). In order to provide the ability to

control what snap-ins can be seen in a console, the MMC allows consoles to be created in

author mode or created in user mode. Author mode allows snap-ins to be added, new

windows to be created, all portions of the console tree can be displayed and for consoles

to be saved. User mode allows consoles to be distributed with restrictions applied. User

mode consoles can have full access granted user so they can make whatever changes they

desire, can have limited access so that users cannot add to the console but they can view

multiple windows in a console, or they can have limited access so that users cannot add

to the console and also cannot view multiple windows in a console.

The Windows 2000 Computer Management console is capable of performing many

system tasks. It is pictured here starting a disk defragmentation.

The main tools that come with Windows 2000 can be found in the Computer

Management console (found in Administrative Tools in the Control Panel). This contains

the event viewer – a means of seeing events and the Windows equivalent of a log file, a

system information viewer, the ability to view open shared folders and shared folder

sessions, a device manager and a tool to view all the local users and groups on the

Windows 2000 computer. It also contains a disk management snap-in, which contains a

disk defragmenter as well as other disk management utilities. Lastly, it also contains a

services viewer, which allows users to view all installed services and to stop and start

them on demand, as well as configure what those services should do when the computer

starts.

REGEDIT.EXE utility:

Windows 2000 comes bundled with two utilities to edit the Windows registry. One acts

like the Windows 9x REGEDIT.EXE program and the other could edit registry

permissions in the same manner that Windows NT’s REGEDT32.EXE program could.

REGEDIT.EXE has a left-side tree view that begins at “My Computer” and lists all

loaded hives. REGEDT32.EXE has a left-side tree view, but each hive has its own

window, so the tree displays only keys. REGEDIT.EXE represents the three components

of a value (its name, type, and data) as separate columns of a table. REGEDT32.EXE

represents them as a list of strings. REGEDIT.EXE was written for the Win32 API and

supports right-clicking of entries in a tree view to adjust properties and other settings.

REGEDT32.EXE was also written for the Win32 API and requires all actions to be

performed from the top menu bar. Because REGEDIT.EXE was directly ported from

Windows 98, it does not support permission editing (permissions do not exist in

Windows 9x). Therefore, the only way to access the full functionality of an NT registry

was with REGEDT32.EXE, which uses the older multiple document interface (MDI),

which newer versions of regedit do not use. Windows XP was the first system to integrate

these two programs into one, adopting the REGEDIT.EXE behavior with the additional

NT functionality.

The System File Checker (SFC) also comes bundled with Windows 2000. It is a

command line utility that scans system files and verifies whether they were signed by

Microsoft and works in conjunction with the Windows File Protection mechanism. It can

also repopulate and repair all the files in the Dllcache folder.

Recovery Console

The Recovery Console is usually used to recover unbootable systems. The Recovery

Console is an application that is run from outside the installed copy of Windows and that

enables a user to perform maintenance tasks that cannot be run from inside of the

installed copy, or cannot be feasibly run from another computer or copy of Windows

2000. It is usually used, however, to recover the system from errors causing booting to

fail, which would render other tools useless.

It presents itself as a simple command line interface. The commands are limited to ones

for checking and repairing the hard drive(s), repairing boot information (including

NTLDR), replacing corrupted system files with fresh copies from the CD, or

enabling/disabling services and drivers for the next boot.

The console can be accessed in one of two ways:

Starting from the Windows 2000 CD, and choosing to enter the Recovery Console or

Installing the Recovery Console via Winnt32.exe, with the /cmdcons switch. However,

the console can then only be used if the system boots to the point where NTLDR can start

it.

Server family functionality

The Windows 2000 server family consists of Windows 2000 Server, Windows 2000

Advanced Server and Windows 2000 Datacenter Server.

All editions of Windows 2000 Server have the following services and functionality built-

in:

Routing and Remote Access Service (RRAS) support, facilitating dial-up and VPN

connections, support for RADIUS authentication, network connection sharing, Network

Address Translation, unicast and multicast routing DNS server, including support for

Dynamic DNS. Active Directory relies heavily on DNS. Microsoft Connection Manager

Administration Kit and Connection Point Services Support for distributed file systems

(DFS))

Hierarchical Storage Management support, a service that runs in conjunction with NTFS

that automatically transfers files that are not used for some period of time to less

expensive storage media

Fault tolerant volumes, namely it supports Mirrored and RAID-5

Group policy (part of Active Directory)

Distributed File System

The Distributed File System, or DFS, allows shares in multiple different locations to be

logically grouped under one folder, or DFS root. When users try to access a share that

exists off the DFS root, the user is really looking at a DFS link and the DFS server

transparently redirects them to the correct file server and share. A DFS root can only exist

on a Windows 2000 version that is part of the server family, and only one DFS root can

exist on that server.

There can be two ways of implementing DFS on Windows 2000: through standalone

DFS, or through domain-based DFS. Standalone DFS allows for only DFS roots that

exist on the local computer, and thus does not use Active Directory. Domain-based DFS

roots exist within Active Directory and can have their information distributed to other

domain controllers within the domain – this provides fault tolerance to DFS. DFS roots

that exist on a domain must be hosted on a domain controller or on a domain member

server. The file and root information is replicated via the Microsoft File Replication

Service (FRS).

Active Directory

Active Directory allows administrators to assign enterprise wide policies, deploy

programs to many computers, and apply critical updates to an entire organization, and is

one of the main reasons why many corporations have moved to Windows 2000. Active

Directory stores information about its users and can act in a similar manner to a phone

book. This allows all of the information and computer settings about an organization to

be stored in a central, organized database. Active Directory Networks can vary from a

small installation with a few hundred objects, to a large installation with millions of

objects. Active Directory can organise groups of resources into a single domain and can

link domains into a contiguous domain name space together to form trees. Groups of

trees that do not exist within the same namespace can be linked together to form forests.

Active Directory can only be installed on a Windows 2000 Server, Advanced Server or

Datacenter Server computer, and cannot be installed on a Windows 2000 Professional

computer. It requires that a DNS service that supports SRV resource records be installed,

or that an existing DNS infrastructure be upgraded to support this functionality. It also

requires that one or more domain controllers exist to hold the Active Directory database

and provide Active Directory directory services.

Volume fault tolerance

Along with support for simple, spanned and striped volumes, the server family of

Windows 2000 also supports fault tolerant volume types. The types supported are

mirrored volumes and RAID-5 volumes:

Mirrored volumes: the volume contains several disks, and when data is written to one it

is mirrored to the other disks. This means that if one disk fails, the data can be totally

recovered from the other disk. Mirrored volumes are also known as RAID-1.

RAID-5 volumes: a RAID-5 volume consists of multiple disks, and it uses block-level

striping with parity data distributed across all member disks. Should a disk fail in the

array, the parity blocks from the surviving disks are combined mathematically with the

data blocks from the surviving disks to reconstruct the data on the failed drive “on-the-

fly” (this works with various levels of success).

Versions

Windows 2000 Professional was designed as the desktop operating system for

businesses and power users. It is the basic unit of Windows 2000, and the most common.

It offers greater security and stability than many of the previous Windows desktop

operating systems. It supports up to two processors, and can address up to 4 GBs of

RAM.

Windows 2000 Server products share the same user interface with Windows 2000

Professional, but contain additional components for running infrastructure and

application software. A significant component of the server products is Active Directory,

which is an enterprise-wide directory service based on LDAP. Additionally, Microsoft

integrated Kerberos network authentication, replacing the often-criticised NTLM

authentication system used in previous versions. This also provided a purely transitive-

trust relationship between Windows 2000 domains in a forest (a collection of one or more

Windows 2000 domains that share a common schema, configuration, and global

catalogue, being linked with two-way transitive trusts). Furthermore, Windows 2000

introduced a DNS server which allows dynamic registration of IP addresses.

Windows 2000 Advanced Server is a variant of Windows 2000 Server operating system

designed for medium-to-large businesses. It offers clustering infrastructure for high

availability and scalability of applications and services, including main memory support

of up to 8 gigabytes (GB) on Page Address Extension (PAE) systems and the ability to do

8-way SMP. It has support for TCP/IP load balancing and enhanced two-node server

clusters based on the Microsoft Cluster Server (MSCS) in the Windows NT Server 4.0

Enterprise Edition. A limited edition 64 bit version of Windows 2000 Advanced Server

was made available via the OEM Channel. It also supports failover and load balancing.

Windows 2000 Datacenter Server is a variant of the Windows 2000 Server that is

designed for large businesses that move large quantities of confidential or sensitive data

frequently via a central server. As with Advanced Server, it supports clustering, failover

and load balancing. Its system requirements are normal, but are compatible with vast

amounts of power: A Pentium-class CPU at 400 MHz or higher – up to 32 are supported

in one machine. 256 MB of RAM – up to 64 GB is supported in one machine.

Approximately 1 GB of available disk space.

Deployment

Windows 2000 can be deployed to a site via various methods. It can be installed onto

servers via traditional media (such as via CD) or via distribution folders that reside on a

shared folder. Installations can be attended or unattended. An attended installation

requires the manual intervention of an operator to choose options when installing the

operating system. Unattended installations are scripted via an answer file, or predefined

script in the form of an INI file that has all the options filled in already. The Winnt.exe or

Winnt32.exe program then uses that answer file to automate the installation. Unattended

installations can be performed via a bootable CD, using Microsoft Systems Management

Server (SMS), via the System Preparation Tool (Sysprep), via running the Winnt32.exe

program using the /syspart switch or via running the Remote Installation Service (RIS).

The Syspart method is started on a standardised reference computer – though the

hardware need not be similar – and it copies the required installation files from the

reference computer’s hard drive to the target computer’s hard drive. The hard drive does

not need to be in the target computer and may be swapped out to it at any time, with

hardware configuration still needing to be done later. The Winnt.exe program must also

be passed a /unattend switch that points to a valid answer file and a /s file to point to the

location of one or more valid installation sources.

Sysprep allows the duplication of a disk image on an existing Windows 2000 Server

installation to multiple servers. This means that all applications and system configuration

settings will be copied across to the new Windows 2000 installations, but it also means

that the reference and target computers must have the same HALs, ACPI support, and

mass storage devices – though Windows 2000 automatically detects plug and play

devices. The primary reason for using Sysprep is for deploying Windows 2000 to a site

that has standard hardware and that needs a fast method of installing Windows 2000 to

those computers. If a system has different HALs, mass storage devices or ACPI support,

then multiple images would need to be maintained.

Systems Management Server can be used to upgrade system to Windows 2000 to

multiple systems. Those operating systems that can be upgraded in this process must be

running a version of Windows that can be upgraded (Windows NT 3.51, Windows NT 4,

Windows 98 and Windows 95 OSR2.x) and those versions must be running the SMS

client agent that can receive software installation operations. Using SMS allows

installations to happen over a wide geographical area and provides centralised control

over upgrades to systems.

Remote Installation Services (RIS) are a means to automatically install Windows 2000

Professional (and not Windows 2000 Server) to a local computer over a network from a

central server. Images do not have to support specific hardware configurations and the

security settings can be configured after the computer reboots as the service generates a

new unique security ID (SID) for the machine. This is required so that local accounts are

given the right identifier and do not clash with other Windows 2000 Professional

computers on a network.

RIS requires that client computers are able to boot over the network via either a network

interface card that has a Pre-Boot Execution Environment (PXE) boot ROM installed or

that it has a network card installed that is supported by the remote boot disk generator.

The remote computer must also meet the Net PC specification. The server that RIS runs

on must be Windows 2000 Server and the server must be able to access a network DNS

Service, a DHCP service and the Active Directory services.

“NDS eDirectory is a cross-platform directory solution that works on NT 4, Windows

2000 when available, Solaris and NetWare 5. Active Directory will only support the

Windows 2000 environment. In addition, eDirectory users can be assured they are using

the most trusted, reliable and mature directory service to manage and control their e-

business relationships – not a 1.0 release.”