Main body 1 2

8/6/2019 Main body 1 2

1/93

1

CHAPTER ONE

INTRODUCTION

1.1 Scope of Research

Drivers fatigue and its related drowsiness is a significant factor in a large number of

vehicle accidents. Recent statistics show that 1,200 deaths and 76,000 injuries can be

annually attributed to fatigue related crashes. This rather disturbing trend urgently requires

the development of early warning systems meant to detect drivers drowsiness whilst on the

wheels (Haro et al., 2000).

The development of technologies for detecting or preventing drowsiness at the wheel

has been a major challenge to the field of accident avoidance systems (Neeta, 2002). Because

of the hazard that drowsiness presents on the road, methods need to be developed for

counteracting these effects. The aim of this project is to improve on the development of

drowsiness detection systems. The focus will be placed on designing a system that will

accurately monitor the open or closed state of the drivers eyes in real-time. By monitoring the

eyes, it is believed that the symptoms of driver fatigue can be detected early enough to avoid a

car accident. Detection of fatigue involves a sequence of images of a face, and the observation

of eye movements and blink patterns.

Eye-blink detection plays an important role in human computer interface (HCI)

systems. It can also be used in drivers assistance systems. Studies show that eye blink

duration has a close relation to a subjects drowsiness (Kojima et al., 2001). The openness of

eyes, as well as the frequency of eye blinks, shows the level of the persons consciousness,

which has potential applications in monitoring drivers vigorous level for additional safety

control. Also, eye blinks can be used as a method of communication for people with severe

disabilities, in which blink patterns can be interpreted as semiotic messages. This provides an

8/6/2019 Main body 1 2

2/93

2

alternate input modality to control a computer: communication by blink pattern. The

duration of eye closure determines whether the blink is voluntary or involuntary. Blink

patterns are used by interpreting voluntary long blinks according to the predefined semiotics

dictionary, while ignoring involuntary short blinks. (Blacket al., 1997)

Eye blink detection has attracted considerable research interest from the computer

vision community. In literature, most existing techniques used two separate steps for eye

tracking and blink detection. For eye blink detection systems, there are three types of

dynamic information involved: the global motion of the eye, local motion of eye lids, and the

eye openness/closure. Once the eyes locations are estimated by the tracking algorithm, the

differences in image appearance between the open eyes and the closed eyes can be used to

find the frames in which the subjects eyes are closed, such that eye blinking can be

determined. Template matching is used to track the eyes and color features are used to

determine the openness of eyes. Detected blinks are then used together with pose and gaze

estimates to monitor the drivers alertness, differences in intensity values between the upper

eye and lower eye are used for eye openness/closure classification, such that closed-eye

frames can be detected. The use of low-level features makes the real-time implementation of

the blink detection systems feasible. However, for videos with large variations, such as the

typical videos collected from in-car cameras, the acquired images are usually noisy and with

low-resolution. In such scenarios, simple low-level features, like color and image differences,

are not sufficient, temporal information is also used by some other researchers for blinking

detection purposes (Grauman et al., 2003).

8/6/2019 Main body 1 2

3/93

3

1.2 Justification of Work

Eye blink is a physiological activity of rapid closing and opening of eyelids, which is

an essential function of eyes that helps spread tears across and remove irritants from the

surface of the cornea and conjunctiva (Tsubota, 1998). Although blink speed can vary with

elements such as fatigue, emotional stress, behavior category, amount of sleep, eye injury,

medication, and disease, researchers report that (Karson, 1983), the spontaneous resting blink

rate of a human being is nearly from 15 to 30 eye blinks per minute. That is, a person blinks

approximately once every 2 to 4 seconds, and a blink lasts averagely 250 milliseconds.

Currently a generic camera can easily capture a face video with not less than 15 fps

(frames per second), i.e. the frame interval is not more than 70 milliseconds. Thus, it is easy

for a generic camera to capture two or more frames for each blink when a face looks into the

camera. The advantages of eye blink based approach is based on the fact that, it is non-

intrusive, and can be used generally without user collaboration and no extra hardware is

required, Eye- blink behavior is the prominently distinguishing character of a live face from a

facial photo from a generic camera.

1.3 Requirements

The system for tracking the eyes should be robust, non intrusive, inexpensive and this

is quite a challenge in the computer vision field. Nowadays eye tracking receives a great deal

of attention for applications such as Facial Expression analysis and driver awareness systems.

Today we can find very accurate eye tracker using external devices. Most modern eye-

trackers use contrast to locate the centre of the pupil and use infrared cameras to create a

corneal reflection, and the triangulation of both to determine the fixation point.

8/6/2019 Main body 1 2

4/93

4

However, eye tracking setups vary greatly; some are head-mounted, some require the

head to be stable (for example, with a chin rest), and some automatically track the head as

well. The eye-tracker described in this thesis is characterized for being a noninvasive eye-

tracker. This is because we do not need any external devices for tracking the eyes besides the

web camera, which records the video stream. Moreover, the efficiency of the eye-tracker is

very important when working with real time communications.

1.4 Objectives of the Research

The specific objectives of the study are to

(a)develop an algorithm to identify and track the location of the drivers eye(b)develop an algorithm for an eye-blink detection system; and(c)design a system that implement (a) and (b); and(d)evaluate the performance of the system in (c)

1.5 Thesis Organisation

The remaining part of this thesis is organized as follows: Chapters Two discusses the major

techniques used in image processing in the design of related systems, and survey of related

works, while the development of the algorithm is presented in Chapter Three, the test

experiment conducted and the outcome of results are the contents of Chapter Four. Chapter

Five of this write-up concludes the report and indicates some directions for further work.

8/6/2019 Main body 1 2

5/93

5

CHAPTER TWO

LITERATURE REVIEW

2.1 Human Eye and Its Behavior

A close-up view of a typical open human eye is shown in Fig. 2.1. The most

significant feature in the eye is the iris. It has a ring structure with a large variety of colours.

The ring might not be completely visible even if the eye is in its normal state (non-closed or

partly closed). Visibility depends on the individual variations (Uzunova, 2005). Most often, it

is partly occluded above by the upper eyelid. Completely visible or occluded by both eyelids

are possible too. The iris changes its position as well from centered rolled to one side or

rolled upwards or downwards. Depending on the speed, when the iris is moving from side to

side, the motion is called smooth pursuit or saccades. A saccade is a rapid iris movement,

which happens when fixation is jumping from one point to another (Galley et al., 2004).

Inside the iris is the pupil a smaller dark circle. Its size varies depending on the light

conditions. Sclera is the white visible portion of the eyeball. At a glance with unaided eye, it

is the brightest part in the eye region, which directly surrounds the iris. Apart from these

features, the eye has two additional salient features the upper and lower (or down) eyelid.

Their Latin names are palpebra superior (the upper eyelid) and palpebra infior(the lower).

The gap between them is called rima palpebrarum.

The eyelids movements are constraints by their physical attributes. The upper eyelid

is a stretchable skin membrane that can cover over the eye. It has great freedom of motion,

ranging from wide open to close and small deformations due to eyeball motion. When the

eye is open, the eyelid is a concave arc connecting the two eye corners. As the eye becomes

8/6/2019 Main body 1 2

6/93

6

Fig. 2.1: The Human eye (Uzunova, 2005).

8/6/2019 Main body 1 2

7/93

7

more and more closed the curvature of the arc become lower, has a line-like shape (when the

eye is nearly closed), follows the lower eyelid (when the eye is closed).

On the other hand the lower eyelid is close to the straight line and moves to the

smaller degree. In this thesis the eyelids contours will be referred as eyelids, if something

else is not explicitly said. The eyelids are meeting each other in the eye corners (angulus

oculi). Here the eye corners will be referred to inner corners (the ones that are closer to the

nose and outer corner). The eye corners are called left or right corners in the way they appear

on the image. The skin-colored growth close to the inner corner is third degenerated

membrane, called membrane nicitans.

The eye features - iris and eyelids, can be involved in very complex movements as

part of the completely human behaviour, expressing different meaning. Here only eye

features and local movements within the eye region are described. The eye-blink is the focus

of this thesis. It is a natural act, which represents closing of the eye followed by an opening

where the upper eyelid performs most of the movements. Similar to blinking is the eyelid

fluttering. This is quick wavering or flapping motions of the upper eyelid. Here blinking and

eyelid fluttering are not distinguished, but blinking and eye closing is not synonyms

especially in context of safety driving system. To distinguish eye closing from blinking time

have to be taken into account.

Blinking can be defined as a temporary hiding of the iris due to the touching of both

eyelids within one second, whereas closing takes longer time. According to the researchers

(Thorslund, 2003), blinking frequency is affected by different factors like mood state, task

demand, etc. In stress-free state the blink rate is 15-20 times per minute. It drops down to 3

times per minute during reading. It increases under stress, time pressure or when close

attention is required. The pattern for detecting drowsiness can be described as follows. In

8/6/2019 Main body 1 2

8/93

8

awake state, the eyelids are far apart before they closed, they are closed for a short interval

and closing the eye (single blink) is repeated rarely. As the person gets tired, the eyelids stay

closer to each other, the time when the eye is closed increases and frequency of blinking

increases as well in other words drowsiness is characterized by long flat blinks (Galley et al.,

2004).

2.2 Image Representation and Acquisition

Any visual scene can be represented by any continuous functions (in two dimensions)

of some analogue quantity, this is typically the reflectance function of a scene: the light is

reflected at each visual level in the scene, such representations is referred to as image and the

value at any point is the image corresponds to the intensity of the reflectance function at the

point.

A continuous analogy representation cannot be conveniently interpreted by a

computer and an alternative representation; the digital image must be used, digital image also

represent the reflectance function of any scene but they do so in a sampled and quantized

form (David, 1991). Fig. 2.2 shows the block diagram depicting processing steps in computer

vision. The basic image acquisition equipment used in this study is the camera

There are two types of semiconductor photo-resistive sensor used in cameras: CCD

(charged coupled devices) and CMOS (complementary metal oxide semiconductor).In a

CCD sensor, every pixels charge is transferred through just one output node to be converted

to voltage, buffered and sent off-chip as an analogue signal. All pixel area can be devoted to

light capture. In a CMOS sensor each pixel has its own charge to voltage conversion and the

8/6/2019 Main body 1 2

9/93

9

Fig. 2.2: Block diagram depicting process steps in computer vision (Zuechi, 2000)

IMAGE ACQUSITION

SEGMENTATION

CODING

(FEATURE EXTRACTION)

ENHANCEMENT

(PREPROCESSING)

IMAGE ANALYSIS

DECISION MAKING

8/6/2019 Main body 1 2

10/93

10

sensor often includes amplifiers, noise correction and digitization circuits, so that the chip

outputs (digital) bits (Sonka et al., 2008). These other function increases the design

completely and reduces the area available for the capture. The chip can be built to require

less off-chip circuitry for basic operation.

The development of semiconductor technology permits the production of matrix-like

sensors based on CMOS technology. This technology is used in mass production in the

semiconductor industry because processors and memories are manufactured using the same

technology, the photosensitive matrix-like element can be integrated to the same chip as the

processor and/or operational memory. This opens the door to 'smart cameras' in which the

images capture and basic image processing is performed on the same chip.

The major advantage of CMOS cameras (as opposed to CCD) is a higher range of

sensed intensities (about 4 orders of magnitude), high speed of read-out (about 100 ns) and

random access to individual pixels. The basic CCD element includes a schottky photodiode

and a field transistor. A photon falling on the junction of the photodiode liberates electrons

from the crystal lattice and creates holes resulting in the electric charge that accumulated in a

capacitor. The collected charge is directly proportional to the light intensity and the duration

of its falling diode.

The sensor elements are arranged into matrix- like grid of pixel a CCD chip. The

charges accumulated by the sensor element are transferred to a horizontal register one row at

a time by vertical shift register. The charges are shifted out in a bucket brigade fashion to the

form the video signal.

There are three inherent problems with CCD chips which are: The blooming effect is

the mutual influence of charge in neighboring pixels. Current CCD sensor technology is able

to suppress the problem (anti-blooming) to a great degree, It is impossible to address directly

8/6/2019 Main body 1 2

11/93

11

individual pixels in the CCD chips because reed out through shift registers is needed,

Individual CCD sensor elements are able to accumulate approximately 30-200 thousands

electrons. The usual level of inherent noise of the CCD sensor is on the level of 20 electrons.

The usual level of inherent noise of the CCD sensor is on the level of 20 electrons.

The signal to noise ratio (SNR) in the case of CCD chip is

(2.1)This implies that the logarithmic noise is approximately 80 dB at best, this causes the CCD

sensor is able to cope with four orders of magnitude of intensity in the best case. This range

drops to approximately two orders of magnitude with common uncooled CCD cameras. The

range of incoming light intensity variations is usually higher.

2.3 Image pre-processing

Pre-processing is the name used for operations on images at the lost of abstraction

both input and output are intensity images (Sonka et al., 2008). These iconic images are

usually of the same kind as the original data captured by the sensor, with an intensity image

usually represented by a matrix or matrices of image function values (brightness).

Pre-processing does not increase image information content. If information is

measured using entropy, then pre-processing typically decreases image information content.

From the information-theoretic viewpoint it can thus be concluded that the best pre-

processing is no pre-processing and without question, the best way to avoid (elaborate) pre-

processing is to concentrate on high- quality image acquisition. Nevertheless, pre-processing

is very useful in a variety of situations since it helps to suppress information that is not

relevant to the specific image processing or analysis task. Therefore, the aim of pre-

8/6/2019 Main body 1 2

12/93

12

processing is an improvement of the image data that suppresses undesired distortions or

enhances some image features important for further processing.

A considerable redundancy of information in most images allows image pre-

processing methods to explore image data itself to learn image characteristics in a statistical

sense. These characteristics are used either to suppress unintended degradations such as noise

or to enhance the image. Neighboring pixels corresponding to a given object in real images

have essentially the same or similar brightness value, so if a distorted pixel can be picked out

from the image, it can usually be restored as an average of neighboring pixels.

Image preprocessing methods are classified in to three categories namely: Pixel Brightness

Transformations, Geometric Transformations, Local Pre-processing. These methods are

discussed in the following sections in detail.

2.3.1 Pixel brightness transformations

A brightness transformation modifies pixel brightness; the transformation depends on

the properties of a pixel itself. There are two classes of pixel brightness transformations

which are brightness corrections and gray-scale transformations. Brightness corrections

modifies the pixel brightness taking into account its original brightness without regard to

position in the image.

2.3.1.1 Position dependent brightness correction

Ideally, the sensitivity of image acquisition and digitization devices should not

depend on position in the image, but this assumption is not valid in many practical cases. The

lens attenuates light more if it passes farther from the optical axis, and the photosensitive part

8/6/2019 Main body 1 2

13/93

13

of the sensor (vacuum-tube camera, CCD camera elements) is not of identical sensitivity.

Uneven object illumination is also a source of degradation.

If degradation is of a systematic nature, it can be suppressed by brightness correction.

A multiplicative error coefficient describes the change from the ideal identity transferfunction. Assume that is the original undegraded image (or desired or true image) and

is the image containing degradation. Then

The error coefficient can be obtained if a reference image with knownbrightness is captured, the simplest being an image of constant brightness c. The degraded

result is the image. The systematic brightness errors can be suppressed as (2.3)This method can be used only if the image degradation process is stable. If we wish to

suppress this kind of error in the image capturing process, we should perhaps re-calibrate the

device (find error coefficients) from time to time.Brightness correction method implicitly assumes linearity of the transformation,which is not true in reality because the brightness scale is limited to some interval. The

calculation according to equation 2.3 can overflow, and the limits of the brightness scale are

used instead, this implies that the best reference image has brightness that is far enough from

both limits. If the gray-scale has 256 brightness levels, the ideal image has brightness values

of 128.

8/6/2019 Main body 1 2

14/93

14

2.3.1.2 Gray- scale transformation

Gray-scale transformations do not depend on the position of the pixel in the image. A

transformation of the original brightness from scale

into brightness q from a new

scale is given by

The most common gray-scale transformations are shown in Fig. 2.3(a); the piecewise

linear function a enhances the image contrast between brightness values . Thefunction b is called brightness thresholding and result in a black and white image; the straight

line c denotes the negative transformation. Digital images have a very limited number of

gray-levels, so gray-scale transformations are easy to realize both in hardware and software.

Often only 256 bytes of memory (called a look-up table) are needed.

The original brightness is the index to the look-up, and the table content gives the

new brightness. The image signal usually passes through a look-up table in the image

displays, enabling simple gray-scale transformations in real time. The same principle can be

used for color displays. A color signal consists of three components: red, green and blue;

three look-up tables provide all possible color scale transformations.

Gray-scale transformations are used mainly when an image is viewed by a human

observer, and a transformed image might be more easily interpreted if the contrast is

enhanced. For instance an X-ray image can often be much clearer after transformation. A

gray-scale transformation for contrast enhancement is usually found automatically using the

histogram equalization technique. The aim is to create an image with equally distributed

brightness levels over the whole brightness scale as in Fig. 2.3(b). Histogram equalization

enhances contrast for brightness values close to histogram maxima, and decrease contrast

near minima.

8/6/2019 Main body 1 2

15/93

15

(a): Perspective progression geometric examples

(b): Histogram equalization of images

Fig. 2.3: Perspective progression and histogram equalization of images (Sonka et al., 2008)

8/6/2019 Main body 1 2

16/93

16

Denote the input of the histogram by H(p) and recall that the input grayscale is .The intention is to find a monotonic pixel brightness transformation such that thedesired output histogram

is uniform over the whole output brightness scale

. The

Histogram can be traced as s discrete probability density function. The monotonic property of

the transform implies

) (2.5)The sums in equation 2.5 can be interpreted as discrete distribution functions. Assume

that the image has rows andcolumns; then the equalized histogram corresponds tothe uniform probability density function whose function value is a constant: (2.6)The values from equation (2.6) replace the left side of equation (2.5). The equalized

histogram can be obtained precisely only for the idealized continuous probability density, in

which case equation 2.5 above becomes.

(2.7)The desired pixel brightness transformation then be derived as

(2.8)

The integral in the equation (2.8) is called the cumulative histogram, which is approximated

by a sum in digital images, so the resulting histogram is not equalized ideally. The discrete

approximation of the continuous pixel brightness transformation from equation 2.8 is

8/6/2019 Main body 1 2

17/93

17

(2.9)2.3.2 Geometric transformations

Geometric transforms are common in computer graphics and are often used in image

analysis as well. They permit elimination of the geometric distortion that occurs when an

image is captured. If one attempts to match two different images of the same object, a

geometric transformation may be needed. We consider geometric transformations only in 2D

as this is sufficient for most digital images. One example is an attempt to match remotely

sensed images of the same area taken after a year, when the more recent image was probably

not taken from precisely the same position. To inspect changes over the year,it is necessary

first to execute a geometric transformation, and then subtract one image from the other.

A geometric transform is a vector function t that maps the pixel (x, y) to a new

position (x, y), an illustration of the whole region transformed on a point to point basis is

shown in Fig. 2.4 and T is defined by its two component equations.

(2.10)The transformation equations Tx and Ty are either known in advance for example, in the case

of rotation, translation, scaling or can be determined from original and transformed images.

Several pixels in both images with known correspondences are used to derive the known

transformation. A geometric transform consists of two basic steps. First is the pixel

coordinate transformation, which maps the co-ordinate of the input image pixel to the point

in the output image. The output point co-ordinate should be computed as continuous values

(real numbers), as the position does not necessarily match the digital grid after the transform

The second step is to find the point in the digital raster which matches the

transformed point and determine its brightness value.

8/6/2019 Main body 1 2

18/93

18

Fig. 2.4: Geometric transform on a plane of images (Sonka et al., 2008).

8/6/2019 Main body 1 2

19/93

19

The brightness is usually computed as an interpolation of the brightness of several

points in the neighborhood. This idea enables the classification of geometric transforms

among other preprocessing techniques, the criterion being that only the neighborhood of a

processed pixel is needed for the calculation. Geometric transforms are on the boundary

between point and local operations.

2.3.2.1 Pixel co-ordinate transformations

Equation (2.10) shows the general case of finding the co-ordinates of a point in the

output image after a geometric transform. It is usually approximated by a polynomial

equation.

(2.11)This transform is linear with respect to the coefficients and so if pairs ofcorresponding points in both images are known; it is possible to determine by solving a set of linear equations. More points than coefficients are usually used toprovide robustness; the mean square method is often used.

In the case where the geometric transform does not change rapidly depending on

position in the image, low-order approximating polynomials, m=2 or m=3 are used, needing

at least 6 or 10 pairs of corresponding points. The corresponding points should be distributed

in the image in a way that can express the geometric transformation usually they are spread

uniformly. In general, the higher the degree of the approximating polynomial, the more

sensitive to distribution of the pairs of corresponding points the geometric transform is.

Equation (2.10) is in practice approximated by a bilinear transform for which four pairs of

corresponding points are sufficient to find the transformation coefficients.

8/6/2019 Main body 1 2

20/93

20

(2.12)Even simpler is the affine transformation, for which three pairs of corresponding points are

sufficient to find coefficients.

(2.13)

The affine transformation includes typical geometric transformations such as rotation,

transformation, scaling, and skewing.A geometric transform applied to the whole image may

change the co-ordinate system, and a Jacobian j provides information about how the co-

ordinate system changes

(2.14)

If the transformation is singular (has no inverse), then J=0. If the area of the image is

invariant under the transformation, then J =1.

The Jacobian for the bilinear transform in equation 2.12 is

(2.15)And for affine transformation in equation 2.13 it is

(2.16)Some important geometric transformations are:Rotation by the angle about the origin

8/6/2019 Main body 1 2

21/93

21

(2.17)Change of scale a in the x axis and b in the y axis

(2.18)Skewing by the angle, given by

It is possible to approximate complex geometric transformations (distortion) by partitioning

an image into smaller rectangular sub-images; for each sub-image, a simple geometric

transformation such as the affine, are estimated using pairs of corresponding pixels. The

geometric transformation (distortion) is then repaired separately in each sub-image.

There are some typical geometric distortions which have to be overcome in remote sensing.

Errors may be caused by distortion of the optical systems, by the non-linearity in row by row

scanning and non-constant sampling period. Wrong position or orientation, skew and line

non-linearity distortions. Panoramic distortion (Fig. 2.5b) appears in line scanners with the

mirror rotating at constant speed. Line non-linearity distortion (Fig. 2.5a) is caused by

variable distance of the object from the scanner mirror. The rotation of the earth during

image capture in a mechanical scanner generates skew distortion (Fig. 2.5c). Change of

distance from the sensor induces changeofscale distortion (Fig. 2.5e). Perspective

progression causes perspective distortion (Fig. 2.5f).

8/6/2019 Main body 1 2

22/93

22

Fig. 2.5: Geometric distortion types in images (Sonka et al., 2008).

(a)Line non-linear distortion (b) panoramic distortion (c) Skew distortion

(d) Paranormal distortion (e) Change of scale distortion (f) perspective distortion

8/6/2019 Main body 1 2

23/93

23

2.3.2.2 Brightness interpolation

Brightness interpolation influences image quality. The simpler the interpolation, the

greater is the loss in geometric and photometric accuracy, but the interpolation neighborhood

is often reasonably small due to computational load. The three most common interpolation

methods are the neighbor, linear, and bi-cubic.

The brightness interpolation problem is usually expressed in a dual way by determing

the brightness of the original point in the input image that corresponds to the point in the

image lying on the discrete raster. Assume that we wish to compute the brightness value of

the pixel

in the output image

and

lie on the discrete raster (integer numbers,

illustrated by solid lines in Fig. 2.5). The co-ordinates of the point (x, y) in the original image

can be obtained by inverting the planar transformation in equation (2.10):

In general, the real coordinates after inverse transformation (dashed lines in Fig. 2.5) do not

fit the input image discrete raster (solid lines), and so the brightness is not known. The only

information available about the originality continuous image f(x, y) is its samples

version. To get the brightness can be expressed by the convolution equation. (2.21)The function is called the interpolation kernel. Usually a small neighborhood is used,outside which is zero (Sonka et al., 2008).

Nearest-neighborhood interpolation assigns to point the brightness value of thenearest point g in the discrete raster as shown in Fig. 2.6 (a). On the right side is theinterpolation kernel in the 1D case. The left side of the figure shows how the newbrightness is assigned. Dashed lines show the inverse planar transformation maps the raster

8/6/2019 Main body 1 2

24/93

24

(a): Nearest neighborhood interpolation

(b): Linear interpolation

Fig. 2.6: Interpolation types in images (Sonka et al., 2008).

8/6/2019 Main body 1 2

25/93

25

of the output image; full lines show the raster of the input image. Nearest-neighborhood

interpolation is given by

(2.22)

The position error of the nearest-neighborhood interpolation is at most half a pixel.

This error is perceptible on objects with straight-line boundaries that may appear step-like

after the transformation.

Linear interpolation explores four points neighboring the point and assumesthat the brightness function is linear in this neighborhood. Linear interpolation is

demonstrated in the Fig. 2.6(b). Linear interpolation can cause a small decrease in resolution, and blurring due to its average

nature. The problem of step-like boundaries with the nearest-neighborhood interpolation is

reduced.

Bipolar interpolation improves the model of the brightness function by approximating

it locally by a bi-cubic polynomial surface; 16 neighboring points are used interpolation. The

one-dimensional interpolation kernel (Mexican hat) is shown in Fig. 2.7 and is given by

(2.4)Linear interpolation can cause a small decrease in resolution, and blurring due its

average nature. The problem of step-like boundaries with the nearest-neighborhood

8/6/2019 Main body 1 2

26/93

26

Fig. 2.7: Bi-cubic interpolation kernel (Sonka et al., 2008).

8/6/2019 Main body 1 2

27/93

27

interpolation is reduced. Bi-cubic interpolation is often is often used in raster displays that

enable zooming with respect to an arbitrary point. If the nearest-neighborhood method were

used, areas of the same brightness would increase. Bi-cubic interpolation preserves fine

details in the image very well.

2.3.3 Local pre-processing

This method uses a small neighborhood of a pixel in an input image to produce a new

brightness value in the output image. Such preprocessing operations are called filtration (or

filtering) if signal processing terminology is used. Local pre-processing methods can be

divided into two groups according to the goal of the processing.

They are smoothing and edge detection. First, smoothing aims to suppress noise or

other fluctuations in the image; it is equivalent to the suppression of high frequencies in the

Fourier transform domain.

Unfortunately smoothing also blurs all sharp edges that bear important information about the

image.

2.3.3.1 Image smoothing

Image smoothing is the set of local pre-processing methods whose predominant use is

the suppression of image noise; it is predominately used in the image data. Calculation of the

new value is based on averaging of brightness values in some neighborhood. Smoothing

poses the problem of blurring sharp edges in the image, and so we shall be more specific on

smoothing methods which are edges preserving.

8/6/2019 Main body 1 2

28/93

28

Local image smoothing can effectively eliminate noise or degradation appearing as thin

stripes, but does not work if degradations are large blobs or thick stripes (Sonka et al., 2008).

2.3.3.2 Median filtering

In probability theory, the median divides the higher half of a probability distribution

from the lower. For random variables x, the median is the value for which the probability of

the outcome x

8/6/2019 Main body 1 2

29/93

29

Fig. 2.8: Horizontal/vertical line preserving neighborhood for median filtering (Sonka et al.,

2008).

8/6/2019 Main body 1 2

30/93

30

2.3.3.4 Non-linear mean filter

The non-linear mean filter is another generalization of average techniques (Pitas and

Venetsanopulos, 1986); it is defined by

(2.25)Where is the result of the filtering, is the pixel in the input image, and is alocal neighborhood of the current pixel. The function of one variable has an inversefunction; the are weight coefficients. If the weights are constant, the filteris called homomorphic.

2.3.3.5Edge detectors

Edge detectors are a collection of very important local image pre-processing methods

used to locate changes in the intensity function; edges are pixels where this function

(brightness) changes abruptly. Edges are to a certain degree invariant to changes of

illumination and viewpoint.

If only edge elements with strong magnitude (edgels) are considered, such information often

suffices for image understanding. The positive effect of such a process is that it leads to

significant reduction of image data. Nevertheless such a data reduction does not undermine

understanding the content of the image (interpretation) in many cases.

An edge is a property attached to an individual pixel and is calculated from the image

function behavior in a neighborhood of that pixel. It is a vector variable with two

components, magnitude and direction. The edge magnitude of the gradient and the edge

direction is rotated with respect to the gradient direction by. The gradient directiongives the direction of maximum growth of the function, e.g. from black to

8/6/2019 Main body 1 2

31/93

31

white . This is illustrated in Fig. 2.9 (a), in which closed lines are lines ofequal brightness. The orientation points east.

Edges are often used in image analysis for finding region boundaries. Provided that

the region has homogeneous brightness, its boundary is at the pixels where the image

function varies and so in the ideal case without noise consists of pixels with high edge

magnitude. It can be seen that the boundary and its parts (edges) are perpendicular to the

direction of the gradient.

. The edge profile in the gradient direction (perpendicular to the edge direction) is

typical for edges. Fig. 2.9(b) shows examples of several standard profiles

Roof edges are typical for objects corresponding to thin lines in the image. Edge detectors are

usually tuned for some type of edge profile.

The gradient magnitude and gradient direction are continuous imagefunctions calculated as:

(2.26)

(2.27)Where is the angle (in radians) from x axis to the point (x, y). Sometimes we areinterested only interested only in edge magnitudes without regards to their orientations, a

linear differential operator called the laplacian may then be used. The laplacian has the same

properties in all directions and is therefore invariant to rotation in the image it is defined as

8/6/2019 Main body 1 2

32/93

32

(a) Gradient direction and edge direction

(b) Typical edge profile

Fig. 2.9 Diagrams illustrating edge detection (Sonka et al., 2008).

8/6/2019 Main body 1 2

33/93

33

Image sharpening has the objectives of making edges steeper the sharpened image is

intended to be observed by human. The sharpened output f is obtained from the input image

g as :

Where C is a positive coefficient which gives the strength of sharpening is a measureof the image function sheerness, calculated using a gradient operator. The lapacian is often

used for this purpose.

Image sharpening can be interpreted in the frequency domain. The result of the

fourier transform is a combination of harmonic functions. The derivative of the harmonic

function sin(nx) is n cos(nx); thus the higher the frequency. Thus the higher the frequency,

the higher the magnitude of its derivative. This explains why gradient operators are used to

enhance edges.

A similar image sharpening technique is in equation (2.29), called unsharp masking

often used in painting industry applications. A signal proportional to an unsharp image

(heavily blurred by a smoothing operator) is subtracted from the original image. A digital

image is discrete in nature and so equations (2.26) and (2.27), containing derivatives, must be

approximated by differences. The first differences of the image g in the vertical direction (for

fixed i) and in the horizontal direction (for fixed j) are given by

Where n is a small integer, usually 1. The value n should be chosen small enough to provide

a good approximation to the derivative, but large enough to neglect unimportant changes in

the image function, symmetric expressions for the difference.

8/6/2019 Main body 1 2

34/93

34

are usually not used because they neglect the impact of the pixel itself.2.4 Segmentation

Segmentation refers to the process of partitioning a digital image into multiple

segments (sets of pixels, also known as super pixels). The goal of segmentation is to simplify

and/or change the representation of an image into something that is more meaningful and

easier to analyze. More precisely, image segmentation is the process of assigning a label to

every pixel in an image such that pixels with the same label share certain visual

characteristics. Segmentation are divided into three groups which are thresholding, edge-

based segmentation and region-based segmentation they discussed in detail.

2.4.1 Thresholding

Gray-level thresholding is the simplest segmentation process. Many objects or image

regions are characterized by constant reflectivity or light absorption of their surfaces; a

brightness constant or threshold can be determined to segment objects and background.

Thresholding is computationally inexpensive and fast, it is the oldest segmentation method

and is still widely used in simple applications; thresholding can easily be done in real time

using specialized hardware. (Sonka et al., 2008).

A complete segmentation of an imageRis a finite set of region

Complete segmentation can result from thresholding in simple scenes. Thresholding

is the transformation of an input imagefto an output (segmented) binary imagegas follows:

8/6/2019 Main body 1 2

35/93

35

where T is the threshold, for image element of object, and for imageelements of the background (or vice versa).

If objects do not touch each other, and if their gray levels are clearly distinct from

background gray levels thresholding is a suitable method. A global threshold is determined

from the whole imagef:

On the other hand, local thresholds are position dependent image f into sub-images and determined in some sub-image, it can be interpolated fromthresholds determined in neighboring sub-images. Each sub-image is then processed with

respect to its local threshold.

Basic thresholding as defined by equations 2.32 has many modifications. one

possibility is to segment an image an image into regions of pixels with gray- levels from a set

D and into background otherwise (band thresholding):

This thresholding can be useful, for instance, for instance, in microscopic blood cell

segmentations where particular gray-level interval represents cytoplasm, the background is

lighter, and the cell kernel darker. This thresholding definition can serve as a border detector

as well; assuming dark objects on light borders. If the gray-level set D is chosen to contain

8/6/2019 Main body 1 2

36/93

36

just these borders gray-levels, and if thresholding according to equation 2.36 is used. There

are many modifications that use multiple thresholds, after which the resulting image is no

longer binary, but rather an image consisting of a very limited set of gray-levels.

(2.37)

Where each Di is a specified subset of gray-levels

Another special choice of gray-level subset Di defines semi-thresholding, which is

sometimes used to make human-assisted analysis easier:

(2.38)

This process aims to mask out the image background, leaving gray-level information present

in the objects. Thresholding has been presented relying only on gray-level image properties.

Note that this is just one of many possibilities; thresholding can be applied if the

values do not represent gradient, a local texture property or the value of any imagedecomposition criterion.

8/6/2019 Main body 1 2

37/93

37

2.4.2 Edge-based segmentation

Edge-based segmentation represents a large group of methods based on information

about edges in the images; it is one of the earliest segmentation approaches and still remains

very important. Edge-based segmentation rely on the edges found in an images by edge

detecting operators, these edges mark image locations of discontinuities in gray-level color,

texture e.t.c

There are several edges based segmentation methods which differ in strategies

leading to final border construction, and also differ in the amount of prior information that

can be incorporated in to the method. The more prior information that is available to the

segmentation process, the better the segmentation results that is available to segmentation

process, the segmentation results that can be obtained. Prior information affects segmentation

algorithms; if a large amount of prior information about the desired result is available, the

boundary shape and relations with other image structures are specified very strictly and the

segmentation must satisfy all these specification. If little information about the boundary is

known, the segmentation method must take local information about the boundary is known,

the segmentation method must take more local information about the image into

consideration and combine it with specific knowledge that is general for an application area.

If little prior information is available, it cannot be used to evaluate the confidence of

segmentation results, and therefore no basis for feedback corrections of segmentation is

available (Sonka et al., 2008).

The most common problems of edge-based segmentation, caused by noise or

unsuitable information in an image, are an edge presence in locations where there is no

border, and no edge presence where a real border exists. Clearly both cases have a negative

influence on segmentation results.

8/6/2019 Main body 1 2

38/93

38

2.4.2.1 Edge image thresholding

Almost no zero pixel are present in an edge image, but small edge values correspond

to non-significant gray-level changes resulting from quantization noise, small lighting

irregularities, e.t.c. simple thresholding of an edge image can be applied to remove this

values. This approach is based on an image of edge magnitudes processed by appropriate

threshold. Selection of appropriate global threshold is often difficult and sometimes

impossible; p-tilling can be applied to define a threshold and a more exact approach using

orthogonal basis functions is described in which, if the original basis functions is described in

which, the original data has a good contrast and is not noisy, gives good results.

2.4.2.2 Edge relaxation

Borders resulting from previous method are strongly affected by image noise, often

with important parts missing. Considering edge properties in the context of their mutual

neighbors can increase the quality of the resulting image. All the image properties, including

those of further edge existence, are iteratively evaluated with more precision until the edge

context is totally clear- based on the strength of edges in a specified local neighborhood; the

confidence of each edge is either increased or decreased.

A weak edge positioned between two strong edges an example of context; it is highly

probable that this inter-positioned weak edge should be part of a resulting boundary. If, on

the other hand, an edge (even a strong one) is positioned by itself with no supporting context,

it is probably not a part of any border.

8/6/2019 Main body 1 2

39/93

39

2.4.3 Region-based segmentation

Region growing techniques are generally better in noisy images, where borders are

extremely difficult to detect. Homogeneity is an important property of regions and is used as

the main segmentation criterion in region growing, whose basic idea is to divide an image

into zones of maximum homogeneity. The criteria for homogeneity can be based on gray-

level, color, texture, shape model (using semantic information), e.t.c. properties chosen to

describe regions influence the form, complexity, and amount of prior information in the

specific region-growing segmentation method.

Region growing segmentation must satisfy the following condition of complete

segmentation:

Where S is the total number of regions in an image and H (Ri) is a binary homogeneity

evaluation of the region Ri. Resulting regions of the segmented image must be both

homogenous and maximal, where by maximal we mean the homogeneity criterion would

not be true after merging a region with any adjacent region. The homogeneity criterion uses

an average gray-level of the region, its color properties, or m-dimensional vector of average

gray values for multi-spectral images.

2.5 Image Analysis/Classification/Interpretation

For some applications the feature, as extracted from the image are all that is required.

Most of the time however one more step must be taken; classification interpretation.

The most important interpretation method is conversion of units. Rarely will dimensions in

pixels or gray level be appropriate for an industrial application. As part of the software, a

8/6/2019 Main body 1 2

40/93

40

calibration procedure will define the conversion factors between vision system units (Nello,

2000).

Reference point and other important quantities are occasionally not visible on the

part, but must be derived from measurable features. For instance, a reference point may be

defined by the axes of the tubes on either side of the bend).Error checking, or image

verification, is a vital process. By closely examining the features found, or extracting

additional feature, test the image itself to verify that it is suited to the processing being done.

Since features are being checked, it can be considered a classification or interpretation step.

Without this, features could have incorrect values because the parts is mislocated, upside

down or missing, because a light has burned out, because the lens is dirty, e.t.c. a philosophy

of fail-safe programming should be adopted; that is any uncertainty about validity of the

image or the processing should either reject parts or shut down the process. This is

imperative in the process control, process verification, and robot guidance, where safety is at

risk. Unfortunately, error checking procedures are usually specific to a certain type of image;

general procedures are not available.

2.6 Decision Making

Decision making in conjunction with classification and interpretation, is characterized

as heuristic, decision theoretic, syntactic or edge tracking. The most commonly used decision

techniques will be discussed

2.6.1 Heuristic

In this case, the basis of the machine vision decision emulates how humans might

characterize the image as such intensity histogram, black-white/black-white transition count,

pixel count, Background /fore ground pixel maps, average intensity value, delta or

8/6/2019 Main body 1 2

41/93

41

normalized image intensity pixel maps, X number of data point, each representing the

integration over some area in the picture row/column totals.

Often times systems are designed to handle decision making to a specific duration of

time. For example some companies have these programs in hardware and consequently, can

handle some decision making as 3000 per minute. These systems typically operate in a train

by showing technique. During training (sometimes called learning), a range of acceptable

representative is shown to the system, and the representation, which is to serve as a standard,

is established. The representation may be based on a single object or on the average of the

images from many objects or may include a family of known good samples, each creating a

representation standard to reflect the acceptable variables.

In operating mode, decision-making based on how close the representations from the

present object being examined compares to the original or standard representation(s). A

goodness-of fit criterion is established during training to reflect the range of acceptable

appearances the system should be tolerant of. If the difference between the representation

established from object under test and the standard exceeds the goodness-of-fit criteria, it is

considered a reject. Significantly, the decision may be based on a combination of criteria

(pixel counts and transition count, for example). The goodness-of-fit criteria then become

based on statistical analysis of the combination of each of the fit criteria.

Decision-making, in conjunction with these approaches, can be either deterministic or

probabilistic. Deterministic means that given some state or set of conditions, the outcome of

a function or process is fully determined with 100% probability of the same outcome.

Probabilistic means that a particular outcome has some probability of occurrence (100%),

given some initial set.

8/6/2019 Main body 1 2

42/93

42

2.6.2 Syntactic Analysis

The ability to make decisions based on pieces of an object is usually based upon

syntactic Analysis, unlike the decision theoretic approach. In this case, the object is

represented as a string, a tree, or graph of pattern primitive relationships. Decision making is

based on a parsing procedure. Another way to view this is a local features analysis (LFA) - a

collection of local features with specified spatial relationships between various combinations.

Again these primitives can be derived from a binary or grayscale images thresholded or edge

processed.

For example, three types of primitive include curve angle and line that together can

be used to describe a region. Image analysis involves decomposing the objects into its

primitives, and relationships of primitive results in recognition. The primitive decision

making can be performed using decision theoretic or statistical techniques.

2.6.3 Edge tracking

In addition to geometric feature extraction of boundary images, image analysis can be

conducted by edge tracking: when the edge is detected, it is stored as a link of edge points.

Alternatively, line encoding and connectivity analysis can be conducted. That is the location

of edge points detected is stored and line fitting is performed (Zuechi, 2000).

Decision making is then based on comparism of line segments directly or based on

probability theory. Line segment descriptions of objects are called structural description. This

process of comparing them to models is called structural pattern recognition.

8/6/2019 Main body 1 2

43/93

43

2.7 Related Works

A thorough survey of work related to eye tracking techniques and eye-blink detection system

will discussed in detail.

2.7.1 Eye tracking technique

Tian et al. (2000a) presents a dual-state eye model, there are two templates - for

closed and open eye. The template for the open eye consists of a circle and two parabolic

arcs. The circle, described by three parameters x0, y0, r ((x0, y0) the centre and r the

radius), represents the iris. The arcs represent the eyelids. They are described by 3 points

one for each eye corner and one on the apex of the eyelids. The template for a closed eye is a

straight line between the eye corners. If the iris is detected, the eye is open and modeled with

a template for an open eye, otherwise it is closed. They assume that the eye features are

given on the first frame. The inner corners are tracked by minimizing the squared difference

between the intensity values regions close to the corners in two subsequent frames. The outer

corners are detected as first, lying on the line between the two inner corners and second stay

apart from them in width w (certain value obtained on the first frame). After botheye corners

are fixed, to complete the eye tracking, the eyelids have to be localized. It is done by tracking

central points on both eyelids by minimizing the squared difference between the intensity

values. They tested their method over 500 image sequences where the full-size face takes

220x300 pixels and each eye region - 60x30. The method works robustly and accurately

across race and expressions variety and make-up presence.

Tian et al. (2000b) have developed a system for recognizing three action units

completely closed, narrow-open and completely open eye by use of Gabor wavelets in nearly

frontal image sequences. The feature points are three inner corner, outer corner and middle

8/6/2019 Main body 1 2

44/93

44

point between the first two. The eye corners are tracked in the whole sequence. The most

important of them for tracking is the inner corner; the positions of the others are relatively

found by the inner corner position. The initial positions of the feature points are given. Then

it is tracked by minimizing thefunction over a certain displacement. The function depends of

the intensity values (grayscale values). The outer corners are detecting by using the size of

the eyes, obtained on the first frame. The middle point is the point in the middle between the

inner and outer corner. For each of these three feature points a set of multi-scale and multi-

orientation Gabor coefficients are calculated. Three spatial frequencies with wavenumber and

six orientations from 0 to differing in /6 are

used. These 18 coefficients are charged into a neural network to determine the state of the

eye. Unfortunately, only the success of detection the eye state (not of the eye corners

tracking) is reported. Since this is the main aim of the paper. The recognition rate when three

action units are recognized is 89% and when only two (the narrow eye equal to closed) it

increases to 93%.

Sirohey et al. (2002) present a flow-based method for tracking. Their method for

detection is based on finding a combination of edge segments, which represents upper eyelid

the best. First the head motion is detected from the edge segments associated with the

silhouette of the head. Based on this information the head is stabilized. The head motion

vectors are subtracted from the iris and eyelid motion and only their independent motions are

left. The eyelids are tracked as follows: The edge pixels of the eyelid that have flow vectors

associated with them are followed according to the direction and magnitude of the flow

vector. If edge pixels are found in close neighborhood to the pointed pixels, they are labeled

as possible belonging to the eyelid. The candidates are fitted into third-order polynomial.

8/6/2019 Main body 1 2

45/93

45

With this method the iris is found at each frame correctly and the eyelids - in 90% of the

frames (two sequences of 120 frames of single person, with and without glasses). The paper

does not mention how the lower eyelid is modeled, extracted and tracked. Blinking is

detected as the height of the apex of the upper eyelid from iris center.

Black et al. (1996) explore a template-based approach combined with optical flow

(for example affine), in which they represent rigid and deformable facial motions using

piecewise parametric models of image motion. The facial features face, eye regions,

eyebrows and mouth are given. The face is broken down into parts and the motion of each of

them is modeled independently by planner motion models. The affine model is sufficient to

model the eye motion.

where u, v horizontal and vertical components of the flow at image point p(x, y). The

coordinates are defined with respect to some image point (typically center of the region).

The difference between the image and the change of the parameters obtained on the

previous is minimized by simple gradient descent scheme. The eye state transition can be

described by tree parameters vertical translation, divergence (isotropic expansion) and

rapid deformation (squashing and stretching), which interpretation is given in Table 2.1.

The curves of all three are plotted against time and are observed for local maximums

and minimums. The changes in function have to appeared nearly on the same time. A eye

blink is detected when translation max & divergence min & deformation max.. The reported

accuracy of 88% for artificial sequences and 73% for TV movies is measured for all facial

expressions. Unfortunately the achieved processing time is 2 min/frame, which is not

applicable for real-time applications.

8/6/2019 Main body 1 2

46/93

46

*Table 2.1: Parameters describing the movement in the eye region

*source: Blacket al. (1996)

8/6/2019 Main body 1 2

47/93

47

Cohn et al. (2004) and Moriayama et al. (2004) present different aspects if the same

system, where a carefully detailed generative eye model (see Fig. 2.10) is used. A template is

built with the usage of two types of parameters structure and motion. The structure

parameters describe the appearance of the eye region, capturing all its racial, individual and

age variations. This includes size and color of the iris, sclera, dark regions near left and right.

corners, the eyelids, width and boldness of the double-fold eyelid, width of the bulge below

the eye, width of the illumination reflection on the bulge and furrow below the bugle. Motion

parameters describe the changes during time. Traditionally the movement of the iris is

described by 2D position of its center

Closing and opening the eye is shown with the height of the eyelids. The skew of the

upper eyelid is also motion parameter to catch changing of the upper eyelid when the eyeball

is moving. Unfortunately, the structure parameters are not automatically implemented. The

model is individualized by manually adjusting the structural parameters. From the

initialization they derived structural parameters, which remain fixed for the entire sequence

later. Further the features are tracked with iterative minimization of mean square error

between the input image and the template, obtained by current motion parameters.

According to Cohn et al. (2000) the model for tracking the eye features and blink

detection is a part of a system for automatic recognizing of the embarrassing smiles. They

tested a hypothesis that there is correlation between head movement, eye gaze and lip

displacement during embarrassing smiles. That is why probably the accuracy of the tracking

method is not measured and is not reported. The second one reports for failure in only 2

image sequences from 576, which happens due to the head tracker. The database includes a

variety of subjects from different ethnic groups, ages and gender. The in-plane and limited

out-of-plane motion is included.

8/6/2019 Main body 1 2

48/93

48

Fig. 2.10: Detail template used ( Moriayama et al. (2004) and Cohn et al. (2004))

8/6/2019 Main body 1 2

49/93

49

An active contour technique is applied by Paradas (2000) to track the eyelids. The model for

the eye consists of two curves one, for the lower eyelid, with one minimum and one, for the

upper eyelid, with one maximum. Tracking of the eyelids is done with active contour

technique where the motion is embedded in the energy minimization process of the snakes.

A closed snake, which tracks the eyelids, is built by selection a small percentage of

the pixels along the contours obtained during initialization or tracked on the previous frame.

Among these points are the eye corners. Motion compensation errors are computed for each

snaxel (x0, y0) within given range of allowed displacement (dx, dy). Those pixels (x0+dx,

y0+dy), which produce the smaller computational error, are selected as possible candidates of

the snaxel (x0, y0) at the current frame. A two-steps dynamic programming algorithm is run

for these candidates.

The paper does not report anything about the running time of the algorithm. The author only

mentions that it is stable against blinking, head translation and rotation, up to the extent,

where the eyes are visible.

2.7.2 Blink detection systems

Very briefly, I would like to mention the ways, in which the authors of the revised

papers detect blinking.

In Tianet al. (2000) blinking is detected if the iris is not visible. This is not the most

appropriate way. If it is assumed that the method of iris detection never fails, and thus gives

false alarms, misclassification might occur because of eye or iris occlusion because of the

head rotation.

Sirohey et al. (2002) detects blinking occurrence as the height of the apex of the

upper eyelid from iris center, which might be a consequence of not tracking the lower eyelid.

8/6/2019 Main body 1 2

50/93

50

The extension of Tians approach (2000) is a paper by Cohn et al. (2002). It focuses

on blink detection, and not on locating the eye features. The eye region is defined on the first

frame manually picking 4 points the eye corners the centre point of the upper eyelid and a

point straight under it. It stays the same within the whole image sequence, because the face

region is stabilized. Eye region is divided into two portions the upper and the lower by the

line connecting the eye corners. Blink detection relays on the fact that the intensity

distributions of the upper and the lower part change when the eye is opening and closing. The

upper part consists of sclera, pupil, eyelash, iris and skin. For all of them only the first and

the last (sclera, skin) contribute for increasing of the average intensity values. When the

upper eyelid is closing, the eyelash is moved in the lower region and the pupil and iris are

replaced by brighter skin, which leads to increasing the overall intensity of the average

intensity of the upper portion and simultaneously decreasing the average intensity of the

lower. The average grey scale intensities of the both portions are plotted against time. The

eye is closed when the curve of the upper has a maximum. The blink is detecting also by

counting the number of crossings and the number of peaks in order to distinguish between

blinking and eyelid flatter. If the blinking is undergoing between two neighbor crossings

there is only one peak, otherwise the peaks are more than one.

Correlation with a template of the persons eye is used in the paper by Grauman

(2001) as classifying the state of the eye. The difference image during first several blinks is

used to detect the eye regions. The candidates are discarded based on the anthropomorphic

measures, as distances between the blobs, their width and height should keep a certain ration

and others. The remaining pairs candidates are classified on the Mahalanobis distance

between their parameter vector and a mean vector of blink-pair property vector. The

bounding box of the detected eye region determines the template. Further, it is decided for

8/6/2019 Main body 1 2

51/93

51

eye blinking by calculating the correlation between this template and the image on the

current frame. As the eye closes, it begins to look less like the template eye and otherwise

when it reopens - more and more similar. The correlation score ranging between 0.85 and 1

classifies the eye as open, the range between 0.55 and 0.8 as closed eye and less than 0.4

the tracker is lost. Again, the technique is appropriate only for blink detection, not on precise

eye feature extraction and tracking. The reported overall detection accuracy is 95.6% on

average 28 frames per second. This result might change for a longer image sequences as

using for detection driver drowsiness, because a template for a single person expires in time,

when the person gets tired.

Ramadan et al. (2002) used the active deformable model technique to track the iris.

A statistical pressure snake, where the internal forces are eliminated, tracks the pupil. The

snake expands and closes the pupil. When the upper eyelid occludes the pupil, i.e. the blink is

undergoing, the snake collapses. The duration of snake collapse is measurement for blink.

After reopening the eye the snake can expand itself again if the iris position is not changes

during blinking, otherwise it has to be initialized (position) manually. Although they reported

very high accuracy of the tracking method, the system suffers by several disadvantages. The

main problem is manual initialization and re-initialization. Further the way of measuring

blinking also does not seem to be very reliable. The snake might collapse in case of saccades,

which will be misunderstanding with blinking. The third one is the position of the camera. It

is attached to the head, which restricts to the head movements and also makes the equipment

to applicable for drivers.

Danisman et al. (2010) presented an automatic drowsy driver monitoring and accident

prevention system that is based on monitoring the changes in the eye blink duration. His

proposed method detects visual changes in eye locations using the proposed horizontal

8/6/2019 Main body 1 2

52/93

52

symmetry feature of the eyes. This new method detects eye blinks via a standard webcam in

real-time at 110 fps for a 320240 resolution. Experimental results in the JZU eye-blink

database showed that the proposed system detects eye blinks with 94% accuracy with a 1%

false positive rate.

8/6/2019 Main body 1 2

53/93

53

CHAPTER THREE

ALGORITHM DEVLOPMENT

3.1 System Flowchart

The flowchart diagram in Fig. 3.1 describes the processes involved in drowsiness

detection of the system. The image is acquired with the aid of a digital camera, which

converts the acquired image to grayscale. The searching for the location of the eye is

initialized by analyzing the involuntary blinks of the user of the system; this is achieved by

motion analysis technique. An online template of the eye is created which is used to update

the position of the eye every thirty seconds caused by slight movement of the drivers head.

Anytime the output of the tracker is lost, the system re-initializes itself by automatically

repeating the process. Once the tracking is successful the system proceeds to extract the

visual cue of the drivers eye by detecting the number of blinks produced. This acquired

information is used to take a decision of when to trigger the alarm. In other words drowsiness

detection keeps track of the number of blinks produced by the user. When the number of

blinks gets to a critical point, which translates into detecting short period of micro-sleep, the

alarm is triggered, if the alarm is not triggered within five minutes the system is designed to

reset itself automatically.

3.2 Software Development

The algorithm was developed in C language in the visual studio environment

interlinked with the OpenCV library which is mainly used for image processing and in the

area computer vision. The algorithm is broken down into five processes which are namely:

eye-detection, template creation, eye tracking, blink detection and drowsiness detection.

8/6/2019 Main body 1 2

54/93

54

Fig. 3.1: Flowchart diagram describing of the eye-blink detection system

Image Acquisition

Eye detection

Eye tracking

Success

Blink detection

Activate alarm

STOP

Drowsiness

detection

NO

YES

YES

NO

8/6/2019 Main body 1 2

55/93

55

A number of significant contributions and advancement has been made to the works of

Grauman et al., (2001) in other to improve on the accuracy and reliability of the system

which will be discussed in detail.

3.3 Eye-Detection

The system will try to locate the position of the eye by analyzing the blinking of the

user in this stage, this is achieved by creating a difference image from the current frame and

previous gray-scaled frame of the driver, the gray-scaled image undergoes binarization.

Binarization is the conversation of gray-scaled image to a binary image which is often used

to show regions of significant movement in the scene. A binary image is an image in which

each pixel assumes the two discrete values in this case 0 and 1; 0 representing black and 1

representing white after thresholding.

The next phase in this stage is to eliminate noise which is often caused by naturally

occurring jitter caused by lighting conditions and camera resolution. We employ some

functions in the OpenCV library, this provides a fast, convenient interface for doing

morphological transformations on image this is called dilation and erosion. They remove

noise and produce fewer and larger connected components. The resultant 3x3 star-shaped

convolution kernel is passed over the binary image in an opening morphological operation.

Listing 1 in Appendix A shows the algorithm for the opening morphological operation.

Candidates eye-blobs are extracted by recursive labeling of the connected

components of the produced binary image. We then determine whether the connected

component is an eye-pair or not i.e. the system is able to consider if each eye pair is a

possible match for users eye. The algorithm for connected component labeling is shown in

Listing 2 in Appendix A.

8/6/2019 Main body 1 2

56/93

56

A number of experimentally-derived heuristics is applied based on the width, height,

vertical distance and horizontal distance to pinpoint the exact pair that most likely represents

the drivers eye (Chau et al., 2005). The system proceeds if the number of connected

components is two or otherwise the process re-initializes itself. This is achieved by some sets

of defined rules such as the width of the components must be about the same, the height of

the connected components must be about the same, and the vertical distance must be small;

this is scrutinized by some set of filters. If these pairs of component pass through the set of

filters, then there is a good indication that the drivers eye has been successfully located.

The name given to this technique is known as motion analysis.

Connected component labeling is applied next to obtain the number of connected

components in the difference image. Fig. 3.2 shows the thresholded difference image prior to

erosion.

3.4 Template Creation

After the connected components have successfully passed through the filter, the larger

of the two components will be chosen, for template creation, due to the fact that size of the

template to be created is directly proportional to the chosen components. The larger the

component chosen the more the brightness information it contains. This will result in more

accurate tracking and, hence, the system obtains the boundary of the selected component,

which will be used to extract a portion of the current frame as the eye template. Since we

need an open eye template, it will be a mistake to create a template the moment the eye is

located. This is because blinking involves closing and opening of the eye and, thus, once the

8/6/2019 Main body 1 2

57/93

57

Fig. 3.2: Transition during the eye detection using the motion analysis technique

8/6/2019 Main body 1 2

58/93

58

eye is located, we set some delay before creating the template. Following this, therefore, we

need an open eye template. Since the user's eyes are still closed at the heuristics filtering

above, there is a need to wait a moment for the user to open his eyes. Listing 4 in Appendix

B shows the algorithm used in creating an online template of the eye.

3.5 Eye-Tracking

Eye detection is not sufficient to give an highly accurate blink information desired,

since there is possibility of head movement from time to time. A fast tracking procedure is

needed to maintain the exact knowledge about the eyes appearance. So having the eye

template and live video feed from camera makes it possible for the system to locate the user's

eye in the subsequent frames using template matching.The searching is limited in a smallsearch window since searching the whole image will use extensive amount of CPU resources.

The system utilizes the square difference matching method, which matches the

squared difference so that a perfect match will be zero and bad matches will be large (Gray,

2008). The equation is given by:

(3.1)Where are the brightness of the pixels at in the template and source imagerespectively, and

is the average value of the pixels in the template raster and

is the average

value of the pixels in the current search window of the image. At any time the squared difference

exceeds a predefined threshold the tracker is believed to be lost, for this event it is critical

that the tracker declares itself lost and re-initialize using by going back to eye detection by

8/6/2019 Main body 1 2

59/93

59

motion analysis technique. Fig. 3.3 shows the sample of the tracked object. Listing 5 in

Appendix A shows the algorithm for locating the eye in subsequent frames, the location of

the best matches is available in minloc. It is used to draw rectangle in the displayed frame to

label the object being tracked as shown in Fig. 3.3

3.5 Blink Detection

A human being must periodically blink to keep his eyes moist. Blinking is

involuntary and fast. Most people do not notice when they blink. However, detecting a

blinking pattern in an image sequence is an easy and reliable means to detect the presence of

a face. Blinking provides a space-time signal which is easily detected and unique to faces.

The fact that both frame of tracked objects in which is the eye. Listing 5 in Appendix A

shows the algorithm for locating the eye.

The algorithms developed in previous works (Grauman et al., (2001) and Chau et al.,

(2005)) eye blinks were detected by the observation of correlation scores such that the

detection of blinking and the analysis of blink duration are solely based on observation of the

correlation scores. This is generated by the tracking the previous step by using the online

template of the users eye. In other words as the users eye closes during the process of a

blink, its similarity to the open eye template decreases. As the users eye is in the normal

open state, very high correlation scores of about 0.85 to 1.0 are reported. As the user blinks,

the scores fall to values of about 0.5 to 0.55.

In theory, when the user blinks, the similarity to the open eye template decreases.

While it is true in most cases, weve found that it is only reliable if the user does not make

any significant head movements. If the user moves his head, the correlation score also

8/6/2019 Main body 1 2

60/93

60

Fig.3.3: Sample frames of the tracked object

8/6/2019 Main body 1 2

61/93

61

decreases even if the user doesn't blink. In this system we use motion analysis to detect eye

blinks, just like the very first stage above. Only this time the detection is limited in a small

search window, the same window that is used to locating the user's eye. Listing shows the

algorithms for blink detection using motion analysis.

From Listing 6 in Appendix A, cvFindContours will return the connected components in

comp, and the number of connected components nc. To determine whether a motion is eye

blink or not, we apply several rules for the connected component: There is only one (1)

connected component; the component is located at the centroid of user's eye.

Note that we require only one (1) connected component, while normally user blink

will yielding two (2) connected components. That's because we perform the motion analysis

in a small search window, where the window fits only for one (1) eye.

3.6 Drowsiness Detection

Driver drowsiness is one specific human error that has been well studied. Studies

have shown that immediately prior to the accidents, the driver's eye change in blinking

behavior (Thorslund, 2003). The basic parameter used to detect drowsiness is the frequency

of blinks, the system detects micro-sleep symptoms in order to diagnose driver fatigue. As

the driver fatigue increases, the blinks of the driver tend to last longer and drowsiness

gradually sets in.

Mathematically:

Where f is the frequency of blinks

8/6/2019 Main body 1 2

62/93

62

T is the blink duration

In other words as the frequency of blinks is inversely proportional to blinking duration,

When a person is highly alert the duration of blinking in a circle is relatively high but as

drowsiness set in the blinking duration drop relatively, invariably this means that when

blinking duration is high the frequency of blinks is low and vice-versa.

The system determines the blink rate by counting the number of consecutive frames in which

the eye remain closed. The system is design to trigger a warning signal via an alarm once the

early stages of drowsiness is detected.

3.7 Hardware consideration

The system is primarily developed and tested on windows vista PC with AMD Turion

3GHz processor and 2 GB RAM. Video was captured with a color CMOS image sensor type

Averon webcam, which captures at 30 frames per second, also processes video as grayscale

images at 320 x240 pixels, resolution of 1.3 mega pixel and signal to noise Ratio of 48dB

which enhance the accuracy of the system.

8/6/2019 Main body 1 2

63/93

63

CHAPTER FOUR

RESULTS AND DISCUSSION

In order to ascertain the reliability of the system performance evaluation was carried

out. Furthermore, compatibility test was carried out on the system. And it was discovered

that it is compatible with operating systems like windows XP and windows vista and

performed satisfactorily well.

4.1 Blink Detection Accuracy

The blink detection accuracy was conducted, using 10 different test subjects, since a

more standard measure of the overall accuracy of the system is across a broad range of users.

In order to measure the detection accuracy, video sequence were captured of each of the test

subjects sitting at 60m away from the camera. They are asked to blink naturally but

frequently and exhibit mild head movements.

A total of 500 true blinks of 10 test subjects were analyzed, in which each test subject

produced 50 blinks. During this evaluation session, the system encountered two types of

errors which are the missed blink error and false positive blink error. Missed blinks occur as

a result of the system not being able to detect the subjects blink when there was actually a

blink. False positive blinks occur when the system detects a blink when there was none

produced by the test subject.

Twelve (12) blinks were missed out of 500, resulting in an initial accuracy of 97.6%.

Furthermore 15 false positive blinks were encountered making the overall accuracy of the

system to be 94.6%. The accuracy of the system and errors encountered in the system are

calculated below:

8/6/2019 Main body 1 2

64/93

64

Table 4.1 shows the summary of result, from the foregoing, the capture rate of the camera,

which is 30 frame/seconds, was used to produce a blink accuracy of 94.6% with a 3% false

positive error. This result is comparable to the work of Danisman et al. (2010), which

employed a camera with a capture rate of 110 frame/seconds in order to obtain an accuracy

of 94.8% and a 1% false positive error.

8/6/2019 Main body 1 2

65/93

65

Table 4.1: Summary of results

Total number of blinks analyzed 500

Total missed blinks 12

Total false positive blinks 15

% percentage initial accuracy of the system 97.6%

% Overall accuracy of the system 94.6%

8/6/2019 Main body 1 2

66/93

66

4.2 Eye tracking accuracy

This experiment was conducted by placing the test subject at varying distance from

the camera. A time constraint of 30 seconds was placed on the system to effect the automatic

initialization of the tracker, which consist of two small bonding boxes, which tends to appear

on the image. If the tracker does not appear within 30 second, the tracker is believed to be

lost. This was conducted with distances of 30cm, 60cm, 90cm, 150cm and 180cm.

A log reading of the number of times the tracker appears in thirty seconds and expressed in

percentage which is given as:

Fig. 4.1 shows a plot of the percentage tracking accuracy against distances from the plot, at a

distance of 30cm the accuracy is 72%, and at a distance of 150cm the accuracy drops to 10%.

The tracking accuracy of the system enables us to ascertain the sensitivity of the

system at varying d

Documents

Main body 1 2