Main body 1 2

Embed Size (px)

Citation preview

  • 8/6/2019 Main body 1 2

    1/93

    1

    CHAPTER ONE

    INTRODUCTION

    1.1 Scope of Research

    Drivers fatigue and its related drowsiness is a significant factor in a large number of

    vehicle accidents. Recent statistics show that 1,200 deaths and 76,000 injuries can be

    annually attributed to fatigue related crashes. This rather disturbing trend urgently requires

    the development of early warning systems meant to detect drivers drowsiness whilst on the

    wheels (Haro et al., 2000).

    The development of technologies for detecting or preventing drowsiness at the wheel

    has been a major challenge to the field of accident avoidance systems (Neeta, 2002). Because

    of the hazard that drowsiness presents on the road, methods need to be developed for

    counteracting these effects. The aim of this project is to improve on the development of

    drowsiness detection systems. The focus will be placed on designing a system that will

    accurately monitor the open or closed state of the drivers eyes in real-time. By monitoring the

    eyes, it is believed that the symptoms of driver fatigue can be detected early enough to avoid a

    car accident. Detection of fatigue involves a sequence of images of a face, and the observation

    of eye movements and blink patterns.

    Eye-blink detection plays an important role in human computer interface (HCI)

    systems. It can also be used in drivers assistance systems. Studies show that eye blink

    duration has a close relation to a subjects drowsiness (Kojima et al., 2001). The openness of

    eyes, as well as the frequency of eye blinks, shows the level of the persons consciousness,

    which has potential applications in monitoring drivers vigorous level for additional safety

    control. Also, eye blinks can be used as a method of communication for people with severe

    disabilities, in which blink patterns can be interpreted as semiotic messages. This provides an

  • 8/6/2019 Main body 1 2

    2/93

    2

    alternate input modality to control a computer: communication by blink pattern. The

    duration of eye closure determines whether the blink is voluntary or involuntary. Blink

    patterns are used by interpreting voluntary long blinks according to the predefined semiotics

    dictionary, while ignoring involuntary short blinks. (Blacket al., 1997)

    Eye blink detection has attracted considerable research interest from the computer

    vision community. In literature, most existing techniques used two separate steps for eye

    tracking and blink detection. For eye blink detection systems, there are three types of

    dynamic information involved: the global motion of the eye, local motion of eye lids, and the

    eye openness/closure. Once the eyes locations are estimated by the tracking algorithm, the

    differences in image appearance between the open eyes and the closed eyes can be used to

    find the frames in which the subjects eyes are closed, such that eye blinking can be

    determined. Template matching is used to track the eyes and color features are used to

    determine the openness of eyes. Detected blinks are then used together with pose and gaze

    estimates to monitor the drivers alertness, differences in intensity values between the upper

    eye and lower eye are used for eye openness/closure classification, such that closed-eye

    frames can be detected. The use of low-level features makes the real-time implementation of

    the blink detection systems feasible. However, for videos with large variations, such as the

    typical videos collected from in-car cameras, the acquired images are usually noisy and with

    low-resolution. In such scenarios, simple low-level features, like color and image differences,

    are not sufficient, temporal information is also used by some other researchers for blinking

    detection purposes (Grauman et al., 2003).

  • 8/6/2019 Main body 1 2

    3/93

    3

    1.2 Justification of Work

    Eye blink is a physiological activity of rapid closing and opening of eyelids, which is

    an essential function of eyes that helps spread tears across and remove irritants from the

    surface of the cornea and conjunctiva (Tsubota, 1998). Although blink speed can vary with

    elements such as fatigue, emotional stress, behavior category, amount of sleep, eye injury,

    medication, and disease, researchers report that (Karson, 1983), the spontaneous resting blink

    rate of a human being is nearly from 15 to 30 eye blinks per minute. That is, a person blinks

    approximately once every 2 to 4 seconds, and a blink lasts averagely 250 milliseconds.

    Currently a generic camera can easily capture a face video with not less than 15 fps

    (frames per second), i.e. the frame interval is not more than 70 milliseconds. Thus, it is easy

    for a generic camera to capture two or more frames for each blink when a face looks into the

    camera. The advantages of eye blink based approach is based on the fact that, it is non-

    intrusive, and can be used generally without user collaboration and no extra hardware is

    required, Eye- blink behavior is the prominently distinguishing character of a live face from a

    facial photo from a generic camera.

    1.3 Requirements

    The system for tracking the eyes should be robust, non intrusive, inexpensive and this

    is quite a challenge in the computer vision field. Nowadays eye tracking receives a great deal

    of attention for applications such as Facial Expression analysis and driver awareness systems.

    Today we can find very accurate eye tracker using external devices. Most modern eye-

    trackers use contrast to locate the centre of the pupil and use infrared cameras to create a

    corneal reflection, and the triangulation of both to determine the fixation point.

  • 8/6/2019 Main body 1 2

    4/93

    4

    However, eye tracking setups vary greatly; some are head-mounted, some require the

    head to be stable (for example, with a chin rest), and some automatically track the head as

    well. The eye-tracker described in this thesis is characterized for being a noninvasive eye-

    tracker. This is because we do not need any external devices for tracking the eyes besides the

    web camera, which records the video stream. Moreover, the efficiency of the eye-tracker is

    very important when working with real time communications.

    1.4 Objectives of the Research

    The specific objectives of the study are to

    (a)develop an algorithm to identify and track the location of the drivers eye(b)develop an algorithm for an eye-blink detection system; and(c)design a system that implement (a) and (b); and(d)evaluate the performance of the system in (c)

    1.5 Thesis Organisation

    The remaining part of this thesis is organized as follows: Chapters Two discusses the major

    techniques used in image processing in the design of related systems, and survey of related

    works, while the development of the algorithm is presented in Chapter Three, the test

    experiment conducted and the outcome of results are the contents of Chapter Four. Chapter

    Five of this write-up concludes the report and indicates some directions for further work.

  • 8/6/2019 Main body 1 2

    5/93

    5

    CHAPTER TWO

    LITERATURE REVIEW

    2.1 Human Eye and Its Behavior

    A close-up view of a typical open human eye is shown in Fig. 2.1. The most

    significant feature in the eye is the iris. It has a ring structure with a large variety of colours.

    The ring might not be completely visible even if the eye is in its normal state (non-closed or

    partly closed). Visibility depends on the individual variations (Uzunova, 2005). Most often, it

    is partly occluded above by the upper eyelid. Completely visible or occluded by both eyelids

    are possible too. The iris changes its position as well from centered rolled to one side or

    rolled upwards or downwards. Depending on the speed, when the iris is moving from side to

    side, the motion is called smooth pursuit or saccades. A saccade is a rapid iris movement,

    which happens when fixation is jumping from one point to another (Galley et al., 2004).

    Inside the iris is the pupil a smaller dark circle. Its size varies depending on the light

    conditions. Sclera is the white visible portion of the eyeball. At a glance with unaided eye, it

    is the brightest part in the eye region, which directly surrounds the iris. Apart from these

    features, the eye has two additional salient features the upper and lower (or down) eyelid.

    Their Latin names are palpebra superior (the upper eyelid) and palpebra infior(the lower).

    The gap between them is called rima palpebrarum.

    The eyelids movements are constraints by their physical attributes. The upper eyelid

    is a stretchable skin membrane that can cover over the eye. It has great freedom of motion,

    ranging from wide open to close and small deformations due to eyeball motion. When the

    eye is open, the eyelid is a concave arc connecting the two eye corners. As the eye becomes

  • 8/6/2019 Main body 1 2

    6/93

    6

    Fig. 2.1: The Human eye (Uzunova, 2005).

  • 8/6/2019 Main body 1 2

    7/93

    7

    more and more closed the curvature of the arc become lower, has a line-like shape (when the

    eye is nearly closed), follows the lower eyelid (when the eye is closed).

    On the other hand the lower eyelid is close to the straight line and moves to the

    smaller degree. In this thesis the eyelids contours will be referred as eyelids, if something

    else is not explicitly said. The eyelids are meeting each other in the eye corners (angulus

    oculi). Here the eye corners will be referred to inner corners (the ones that are closer to the

    nose and outer corner). The eye corners are called left or right corners in the way they appear

    on the image. The skin-colored growth close to the inner corner is third degenerated

    membrane, called membrane nicitans.

    The eye features - iris and eyelids, can be involved in very complex movements as

    part of the completely human behaviour, expressing different meaning. Here only eye

    features and local movements within the eye region are described. The eye-blink is the focus

    of this thesis. It is a natural act, which represents closing of the eye followed by an opening

    where the upper eyelid performs most of the movements. Similar to blinking is the eyelid

    fluttering. This is quick wavering or flapping motions of the upper eyelid. Here blinking and

    eyelid fluttering are not distinguished, but blinking and eye closing is not synonyms

    especially in context of safety driving system. To distinguish eye closing from blinking time

    have to be taken into account.

    Blinking can be defined as a temporary hiding of the iris due to the touching of both

    eyelids within one second, whereas closing takes longer time. According to the researchers

    (Thorslund, 2003), blinking frequency is affected by different factors like mood state, task

    demand, etc. In stress-free state the blink rate is 15-20 times per minute. It drops down to 3

    times per minute during reading. It increases under stress, time pressure or when close

    attention is required. The pattern for detecting drowsiness can be described as follows. In

  • 8/6/2019 Main body 1 2

    8/93

    8

    awake state, the eyelids are far apart before they closed, they are closed for a short interval

    and closing the eye (single blink) is repeated rarely. As the person gets tired, the eyelids stay

    closer to each other, the time when the eye is closed increases and frequency of blinking

    increases as well in other words drowsiness is characterized by long flat blinks (Galley et al.,

    2004).

    2.2 Image Representation and Acquisition

    Any visual scene can be represented by any continuous functions (in two dimensions)

    of some analogue quantity, this is typically the reflectance function of a scene: the light is

    reflected at each visual level in the scene, such representations is referred to as image and the

    value at any point is the image corresponds to the intensity of the reflectance function at the

    point.

    A continuous analogy representation cannot be conveniently interpreted by a

    computer and an alternative representation; the digital image must be used, digital image also

    represent the reflectance function of any scene but they do so in a sampled and quantized

    form (David, 1991). Fig. 2.2 shows the block diagram depicting processing steps in computer

    vision. The basic image acquisition equipment used in this study is the camera

    There are two types of semiconductor photo-resistive sensor used in cameras: CCD

    (charged coupled devices) and CMOS (complementary metal oxide semiconductor).In a

    CCD sensor, every pixels charge is transferred through just one output node to be converted

    to voltage, buffered and sent off-chip as an analogue signal. All pixel area can be devoted to

    light capture. In a CMOS sensor each pixel has its own charge to voltage conversion and the

  • 8/6/2019 Main body 1 2

    9/93

    9

    Fig. 2.2: Block diagram depicting process steps in computer vision (Zuechi, 2000)

    IMAGE ACQUSITION

    SEGMENTATION

    CODING

    (FEATURE EXTRACTION)

    ENHANCEMENT

    (PREPROCESSING)

    IMAGE ANALYSIS

    DECISION MAKING

  • 8/6/2019 Main body 1 2

    10/93

    10

    sensor often includes amplifiers, noise correction and digitization circuits, so that the chip

    outputs (digital) bits (Sonka et al., 2008). These other function increases the design

    completely and reduces the area available for the capture. The chip can be built to require

    less off-chip circuitry for basic operation.

    The development of semiconductor technology permits the production of matrix-like

    sensors based on CMOS technology. This technology is used in mass production in the

    semiconductor industry because processors and memories are manufactured using the same

    technology, the photosensitive matrix-like element can be integrated to the same chip as the

    processor and/or operational memory. This opens the door to 'smart cameras' in which the

    images capture and basic image processing is performed on the same chip.

    The major advantage of CMOS cameras (as opposed to CCD) is a higher range of

    sensed intensities (about 4 orders of magnitude), high speed of read-out (about 100 ns) and

    random access to individual pixels. The basic CCD element includes a schottky photodiode

    and a field transistor. A photon falling on the junction of the photodiode liberates electrons

    from the crystal lattice and creates holes resulting in the electric charge that accumulated in a

    capacitor. The collected charge is directly proportional to the light intensity and the duration

    of its falling diode.

    The sensor elements are arranged into matrix- like grid of pixel a CCD chip. The

    charges accumulated by the sensor element are transferred to a horizontal register one row at

    a time by vertical shift register. The charges are shifted out in a bucket brigade fashion to the

    form the video signal.

    There are three inherent problems with CCD chips which are: The blooming effect is

    the mutual influence of charge in neighboring pixels. Current CCD sensor technology is able

    to suppress the problem (anti-blooming) to a great degree, It is impossible to address directly

  • 8/6/2019 Main body 1 2

    11/93

    11

    individual pixels in the CCD chips because reed out through shift registers is needed,

    Individual CCD sensor elements are able to accumulate approximately 30-200 thousands

    electrons. The usual level of inherent noise of the CCD sensor is on the level of 20 electrons.

    The usual level of inherent noise of the CCD sensor is on the level of 20 electrons.

    The signal to noise ratio (SNR) in the case of CCD chip is

    (2.1)This implies that the logarithmic noise is approximately 80 dB at best, this causes the CCD

    sensor is able to cope with four orders of magnitude of intensity in the best case. This range

    drops to approximately two orders of magnitude with common uncooled CCD cameras. The

    range of incoming light intensity variations is usually higher.

    2.3 Image pre-processing

    Pre-processing is the name used for operations on images at the lost of abstraction

    both input and output are intensity images (Sonka et al., 2008). These iconic images are

    usually of the same kind as the original data captured by the sensor, with an intensity image

    usually represented by a matrix or matrices of image function values (brightness).

    Pre-processing does not increase image information content. If information is

    measured using entropy, then pre-processing typically decreases image information content.

    From the information-theoretic viewpoint it can thus be concluded that the best pre-

    processing is no pre-processing and without question, the best way to avoid (elaborate) pre-

    processing is to concentrate on high- quality image acquisition. Nevertheless, pre-processing

    is very useful in a variety of situations since it helps to suppress information that is not

    relevant to the specific image processing or analysis task. Therefore, the aim of pre-

  • 8/6/2019 Main body 1 2

    12/93

    12

    processing is an improvement of the image data that suppresses undesired distortions or

    enhances some image features important for further processing.

    A considerable redundancy of information in most images allows image pre-

    processing methods to explore image data itself to learn image characteristics in a statistical

    sense. These characteristics are used either to suppress unintended degradations such as noise

    or to enhance the image. Neighboring pixels corresponding to a given object in real images

    have essentially the same or similar brightness value, so if a distorted pixel can be picked out

    from the image, it can usually be restored as an average of neighboring pixels.

    Image preprocessing methods are classified in to three categories namely: Pixel Brightness

    Transformations, Geometric Transformations, Local Pre-processing. These methods are

    discussed in the following sections in detail.

    2.3.1 Pixel brightness transformations

    A brightness transformation modifies pixel brightness; the transformation depends on

    the properties of a pixel itself. There are two classes of pixel brightness transformations

    which are brightness corrections and gray-scale transformations. Brightness corrections

    modifies the pixel brightness taking into account its original brightness without regard to

    position in the image.

    2.3.1.1 Position dependent brightness correction

    Ideally, the sensitivity of image acquisition and digitization devices should not

    depend on position in the image, but this assumption is not valid in many practical cases. The

    lens attenuates light more if it passes farther from the optical axis, and the photosensitive part

  • 8/6/2019 Main body 1 2

    13/93

    13

    of the sensor (vacuum-tube camera, CCD camera elements) is not of identical sensitivity.

    Uneven object illumination is also a source of degradation.

    If degradation is of a systematic nature, it can be suppressed by brightness correction.

    A multiplicative error coefficient describes the change from the ideal identity transferfunction. Assume that is the original undegraded image (or desired or true image) and

    is the image containing degradation. Then

    The error coefficient can be obtained if a reference image with knownbrightness is captured, the simplest being an image of constant brightness c. The degraded

    result is the image. The systematic brightness errors can be suppressed as (2.3)This method can be used only if the image degradation process is stable. If we wish to

    suppress this kind of error in the image capturing process, we should perhaps re-calibrate the

    device (find error coefficients) from time to time.Brightness correction method implicitly assumes linearity of the transformation,which is not true in reality because the brightness scale is limited to some interval. The

    calculation according to equation 2.3 can overflow, and the limits of the brightness scale are

    used instead, this implies that the best reference image has brightness that is far enough from

    both limits. If the gray-scale has 256 brightness levels, the ideal image has brightness values

    of 128.

  • 8/6/2019 Main body 1 2

    14/93

    14

    2.3.1.2 Gray- scale transformation

    Gray-scale transformations do not depend on the position of the pixel in the image. A

    transformation of the original brightness from scale

    into brightness q from a new

    scale is given by

    The most common gray-scale transformations are shown in Fig. 2.3(a); the piecewise

    linear function a enhances the image contrast between brightness values . Thefunction b is called brightness thresholding and result in a black and white image; the straight

    line c denotes the negative transformation. Digital images have a very limited number of

    gray-levels, so gray-scale transformations are easy to realize both in hardware and software.

    Often only 256 bytes of memory (called a look-up table) are needed.

    The original brightness is the index to the look-up, and the table content gives the

    new brightness. The image signal usually passes through a look-up table in the image

    displays, enabling simple gray-scale transformations in real time. The same principle can be

    used for color displays. A color signal consists of three components: red, green and blue;

    three look-up tables provide all possible color scale transformations.

    Gray-scale transformations are used mainly when an image is viewed by a human

    observer, and a transformed image might be more easily interpreted if the contrast is

    enhanced. For instance an X-ray image can often be much clearer after transformation. A

    gray-scale transformation for contrast enhancement is usually found automatically using the

    histogram equalization technique. The aim is to create an image with equally distributed

    brightness levels over the whole brightness scale as in Fig. 2.3(b). Histogram equalization

    enhances contrast for brightness values close to histogram maxima, and decrease contrast

    near minima.

  • 8/6/2019 Main body 1 2

    15/93

    15

    (a): Perspective progression geometric examples

    (b): Histogram equalization of images

    Fig. 2.3: Perspective progression and histogram equalization of images (Sonka et al., 2008)

  • 8/6/2019 Main body 1 2

    16/93

    16

    Denote the input of the histogram by H(p) and recall that the input grayscale is .The intention is to find a monotonic pixel brightness transformation such that thedesired output histogram

    is uniform over the whole output brightness scale

    . The

    Histogram can be traced as s discrete probability density function. The monotonic property of

    the transform implies

    ) (2.5)The sums in equation 2.5 can be interpreted as discrete distribution functions. Assume

    that the image has rows andcolumns; then the equalized histogram corresponds tothe uniform probability density function whose function value is a constant: (2.6)The values from equation (2.6) replace the left side of equation (2.5). The equalized

    histogram can be obtained precisely only for the idealized continuous probability density, in

    which case equation 2.5 above becomes.

    (2.7)The desired pixel brightness transformation then be derived as

    (2.8)

    The integral in the equation (2.8) is called the cumulative histogram, which is approximated

    by a sum in digital images, so the resulting histogram is not equalized ideally. The discrete

    approximation of the continuous pixel brightness transformation from equation 2.8 is

  • 8/6/2019 Main body 1 2

    17/93

    17

    (2.9)2.3.2 Geometric transformations

    Geometric transforms are common in computer graphics and are often used in image

    analysis as well. They permit elimination of the geometric distortion that occurs when an

    image is captured. If one attempts to match two different images of the same object, a

    geometric transformation may be needed. We consider geometric transformations only in 2D

    as this is sufficient for most digital images. One example is an attempt to match remotely

    sensed images of the same area taken after a year, when the more recent image was probably

    not taken from precisely the same position. To inspect changes over the year,it is necessary

    first to execute a geometric transformation, and then subtract one image from the other.

    A geometric transform is a vector function t that maps the pixel (x, y) to a new

    position (x, y), an illustration of the whole region transformed on a point to point basis is

    shown in Fig. 2.4 and T is defined by its two component equations.

    (2.10)The transformation equations Tx and Ty are either known in advance for example, in the case

    of rotation, translation, scaling or can be determined from original and transformed images.

    Several pixels in both images with known correspondences are used to derive the known

    transformation. A geometric transform consists of two basic steps. First is the pixel

    coordinate transformation, which maps the co-ordinate of the input image pixel to the point

    in the output image. The output point co-ordinate should be computed as continuous values

    (real numbers), as the position does not necessarily match the digital grid after the transform

    The second step is to find the point in the digital raster which matches the

    transformed point and determine its brightness value.

  • 8/6/2019 Main body 1 2

    18/93

    18

    Fig. 2.4: Geometric transform on a plane of images (Sonka et al., 2008).

  • 8/6/2019 Main body 1 2

    19/93

    19

    The brightness is usually computed as an interpolation of the brightness of several

    points in the neighborhood. This idea enables the classification of geometric transforms

    among other preprocessing techniques, the criterion being that only the neighborhood of a

    processed pixel is needed for the calculation. Geometric transforms are on the boundary

    between point and local operations.

    2.3.2.1 Pixel co-ordinate transformations

    Equation (2.10) shows the general case of finding the co-ordinates of a point in the

    output image after a geometric transform. It is usually approximated by a polynomial

    equation.

    (2.11)This transform is linear with respect to the coefficients and so if pairs ofcorresponding points in both images are known; it is possible to determine by solving a set of linear equations. More points than coefficients are usually used toprovide robustness; the mean square method is often used.

    In the case where the geometric transform does not change rapidly depending on

    position in the image, low-order approximating polynomials, m=2 or m=3 are used, needing

    at least 6 or 10 pairs of corresponding points. The corresponding points should be distributed

    in the image in a way that can express the geometric transformation usually they are spread

    uniformly. In general, the higher the degree of the approximating polynomial, the more

    sensitive to distribution of the pairs of corresponding points the geometric transform is.

    Equation (2.10) is in practice approximated by a bilinear transform for which four pairs of

    corresponding points are sufficient to find the transformation coefficients.

  • 8/6/2019 Main body 1 2

    20/93

    20

    (2.12)Even simpler is the affine transformation, for which three pairs of corresponding points are

    sufficient to find coefficients.

    (2.13)

    The affine transformation includes typical geometric transformations such as rotation,

    transformation, scaling, and skewing.A geometric transform applied to the whole image may

    change the co-ordinate system, and a Jacobian j provides information about how the co-

    ordinate system changes

    (2.14)

    If the transformation is singular (has no inverse), then J=0. If the area of the image is

    invariant under the transformation, then J =1.

    The Jacobian for the bilinear transform in equation 2.12 is

    (2.15)And for affine transformation in equation 2.13 it is

    (2.16)Some important geometric transformations are:Rotation by the angle about the origin

  • 8/6/2019 Main body 1 2

    21/93

    21

    (2.17)Change of scale a in the x axis and b in the y axis

    (2.18)Skewing by the angle, given by

    It is possible to approximate complex geometric transformations (distortion) by partitioning

    an image into smaller rectangular sub-images; for each sub-image, a simple geometric

    transformation such as the affine, are estimated using pairs of corresponding pixels. The

    geometric transformation (distortion) is then repaired separately in each sub-image.

    There are some typical geometric distortions which have to be overcome in remote sensing.

    Errors may be caused by distortion of the optical systems, by the non-linearity in row by row

    scanning and non-constant sampling period. Wrong position or orientation, skew and line

    non-linearity distortions. Panoramic distortion (Fig. 2.5b) appears in line scanners with the

    mirror rotating at constant speed. Line non-linearity distortion (Fig. 2.5a) is caused by

    variable distance of the object from the scanner mirror. The rotation of the earth during

    image capture in a mechanical scanner generates skew distortion (Fig. 2.5c). Change of

    distance from the sensor induces changeofscale distortion (Fig. 2.5e). Perspective

    progression causes perspective distortion (Fig. 2.5f).

  • 8/6/2019 Main body 1 2

    22/93

    22

    Fig. 2.5: Geometric distortion types in images (Sonka et al., 2008).

    (a)Line non-linear distortion (b) panoramic distortion (c) Skew distortion

    (d) Paranormal distortion (e) Change of scale distortion (f) perspective distortion

  • 8/6/2019 Main body 1 2

    23/93

    23

    2.3.2.2 Brightness interpolation

    Brightness interpolation influences image quality. The simpler the interpolation, the

    greater is the loss in geometric and photometric accuracy, but the interpolation neighborhood

    is often reasonably small due to computational load. The three most common interpolation

    methods are the neighbor, linear, and bi-cubic.

    The brightness interpolation problem is usually expressed in a dual way by determing

    the brightness of the original point in the input image that corresponds to the point in the

    image lying on the discrete raster. Assume that we wish to compute the brightness value of

    the pixel

    in the output image

    and

    lie on the discrete raster (integer numbers,

    illustrated by solid lines in Fig. 2.5). The co-ordinates of the point (x, y) in the original image

    can be obtained by inverting the planar transformation in equation (2.10):

    In general, the real coordinates after inverse transformation (dashed lines in Fig. 2.5) do not

    fit the input image discrete raster (solid lines), and so the brightness is not known. The only

    information available about the originality continuous image f(x, y) is its samples

    version. To get the brightness can be expressed by the convolution equation. (2.21)The function is called the interpolation kernel. Usually a small neighborhood is used,outside which is zero (Sonka et al., 2008).

    Nearest-neighborhood interpolation assigns to point the brightness value of thenearest point g in the discrete raster as shown in Fig. 2.6 (a). On the right side is theinterpolation kernel in the 1D case. The left side of the figure shows how the newbrightness is assigned. Dashed lines show the inverse planar transformation maps the raster

  • 8/6/2019 Main body 1 2

    24/93

    24

    (a): Nearest neighborhood interpolation

    (b): Linear interpolation

    Fig. 2.6: Interpolation types in images (Sonka et al., 2008).

  • 8/6/2019 Main body 1 2

    25/93

    25

    of the output image; full lines show the raster of the input image. Nearest-neighborhood

    interpolation is given by

    (2.22)

    The position error of the nearest-neighborhood interpolation is at most half a pixel.

    This error is perceptible on objects with straight-line boundaries that may appear step-like

    after the transformation.

    Linear interpolation explores four points neighboring the point and assumesthat the brightness function is linear in this neighborhood. Linear interpolation is

    demonstrated in the Fig. 2.6(b). Linear interpolation can cause a small decrease in resolution, and blurring due to its average

    nature. The problem of step-like boundaries with the nearest-neighborhood interpolation is

    reduced.

    Bipolar interpolation improves the model of the brightness function by approximating

    it locally by a bi-cubic polynomial surface; 16 neighboring points are used interpolation. The

    one-dimensional interpolation kernel (Mexican hat) is shown in Fig. 2.7 and is given by

    (2.4)Linear interpolation can cause a small decrease in resolution, and blurring due its

    average nature. The problem of step-like boundaries with the nearest-neighborhood

  • 8/6/2019 Main body 1 2

    26/93

    26

    Fig. 2.7: Bi-cubic interpolation kernel (Sonka et al., 2008).

  • 8/6/2019 Main body 1 2

    27/93

    27

    interpolation is reduced. Bi-cubic interpolation is often is often used in raster displays that

    enable zooming with respect to an arbitrary point. If the nearest-neighborhood method were

    used, areas of the same brightness would increase. Bi-cubic interpolation preserves fine

    details in the image very well.

    2.3.3 Local pre-processing

    This method uses a small neighborhood of a pixel in an input image to produce a new

    brightness value in the output image. Such preprocessing operations are called filtration (or

    filtering) if signal processing terminology is used. Local pre-processing methods can be

    divided into two groups according to the goal of the processing.

    They are smoothing and edge detection. First, smoothing aims to suppress noise or

    other fluctuations in the image; it is equivalent to the suppression of high frequencies in the

    Fourier transform domain.

    Unfortunately smoothing also blurs all sharp edges that bear important information about the

    image.

    2.3.3.1 Image smoothing

    Image smoothing is the set of local pre-processing methods whose predominant use is

    the suppression of image noise; it is predominately used in the image data. Calculation of the

    new value is based on averaging of brightness values in some neighborhood. Smoothing

    poses the problem of blurring sharp edges in the image, and so we shall be more specific on

    smoothing methods which are edges preserving.

  • 8/6/2019 Main body 1 2

    28/93

    28

    Local image smoothing can effectively eliminate noise or degradation appearing as thin

    stripes, but does not work if degradations are large blobs or thick stripes (Sonka et al., 2008).

    2.3.3.2 Median filtering

    In probability theory, the median divides the higher half of a probability distribution

    from the lower. For random variables x, the median is the value for which the probability of

    the outcome x

  • 8/6/2019 Main body 1 2

    29/93

    29

    Fig. 2.8: Horizontal/vertical line preserving neighborhood for median filtering (Sonka et al.,

    2008).

  • 8/6/2019 Main body 1 2

    30/93

    30

    2.3.3.4 Non-linear mean filter

    The non-linear mean filter is another generalization of average techniques (Pitas and

    Venetsanopulos, 1986); it is defined by

    (2.25)Where is the result of the filtering, is the pixel in the input image, and is alocal neighborhood of the current pixel. The function of one variable has an inversefunction; the are weight coefficients. If the weights are constant, the filteris called homomorphic.

    2.3.3.5Edge detectors

    Edge detectors are a collection of very important local image pre-processing methods

    used to locate changes in the intensity function; edges are pixels where this function

    (brightness) changes abruptly. Edges are to a certain degree invariant to changes of

    illumination and viewpoint.

    If only edge elements with strong magnitude (edgels) are considered, such information often

    suffices for image understanding. The positive effect of such a process is that it leads to

    significant reduction of image data. Nevertheless such a data reduction does not undermine

    understanding the content of the image (interpretation) in many cases.

    An edge is a property attached to an individual pixel and is calculated from the image

    function behavior in a neighborhood of that pixel. It is a vector variable with two

    components, magnitude and direction. The edge magnitude of the gradient and the edge

    direction is rotated with respect to the gradient direction by. The gradient directiongives the direction of maximum growth of the function, e.g. from black to

  • 8/6/2019 Main body 1 2

    31/93

    31

    white . This is illustrated in Fig. 2.9 (a), in which closed lines are lines ofequal brightness. The orientation points east.

    Edges are often used in image analysis for finding region boundaries. Provided that

    the region has homogeneous brightness, its boundary is at the pixels where the image

    function varies and so in the ideal case without noise consists of pixels with high edge

    magnitude. It can be seen that the boundary and its parts (edges) are perpendicular to the

    direction of the gradient.

    . The edge profile in the gradient direction (perpendicular to the edge direction) is

    typical for edges. Fig. 2.9(b) shows examples of several standard profiles

    Roof edges are typical for objects corresponding to thin lines in the image. Edge detectors are

    usually tuned for some type of edge profile.

    The gradient magnitude and gradient direction are continuous imagefunctions calculated as:

    (2.26)

    (2.27)Where is the angle (in radians) from x axis to the point (x, y). Sometimes we areinterested only interested only in edge magnitudes without regards to their orientations, a

    linear differential operator called the laplacian may then be used. The laplacian has the same

    properties in all directions and is therefore invariant to rotation in the image it is defined as

  • 8/6/2019 Main body 1 2

    32/93

    32

    (a) Gradient direction and edge direction

    (b) Typical edge profile

    Fig. 2.9 Diagrams illustrating edge detection (Sonka et al., 2008).

  • 8/6/2019 Main body 1 2

    33/93

    33

    Image sharpening has the objectives of making edges steeper the sharpened image is

    intended to be observed by human. The sharpened output f is obtained from the input image

    g as :

    Where C is a positive coefficient which gives the strength of sharpening is a measureof the image function sheerness, calculated using a gradient operator. The lapacian is often

    used for this purpose.

    Image sharpening can be interpreted in the frequency domain. The result of the

    fourier transform is a combination of harmonic functions. The derivative of the harmonic

    function sin(nx) is n cos(nx); thus the higher the frequency. Thus the higher the frequency,

    the higher the magnitude of its derivative. This explains why gradient operators are used to

    enhance edges.

    A similar image sharpening technique is in equation (2.29), called unsharp masking

    often used in painting industry applications. A signal proportional to an unsharp image

    (heavily blurred by a smoothing operator) is subtracted from the original image. A digital

    image is discrete in nature and so equations (2.26) and (2.27), containing derivatives, must be

    approximated by differences. The first differences of the image g in the vertical direction (for

    fixed i) and in the horizontal direction (for fixed j) are given by

    Where n is a small integer, usually 1. The value n should be chosen small enough to provide

    a good approximation to the derivative, but large enough to neglect unimportant changes in

    the image function, symmetric expressions for the difference.

  • 8/6/2019 Main body 1 2

    34/93

    34

    are usually not used because they neglect the impact of the pixel itself.2.4 Segmentation

    Segmentation refers to the process of partitioning a digital image into multiple

    segments (sets of pixels, also known as super pixels). The goal of segmentation is to simplify

    and/or change the representation of an image into something that is more meaningful and

    easier to analyze. More precisely, image segmentation is the process of assigning a label to

    every pixel in an image such that pixels with the same label share certain visual

    characteristics. Segmentation are divided into three groups which are thresholding, edge-

    based segmentation and region-based segmentation they discussed in detail.

    2.4.1 Thresholding

    Gray-level thresholding is the simplest segmentation process. Many objects or image

    regions are characterized by constant reflectivity or light absorption of their surfaces; a

    brightness constant or threshold can be determined to segment objects and background.

    Thresholding is computationally inexpensive and fast, it is the oldest segmentation method

    and is still widely used in simple applications; thresholding can easily be done in real time

    using specialized hardware. (Sonka et al., 2008).

    A complete segmentation of an imageRis a finite set of region

    Complete segmentation can result from thresholding in simple scenes. Thresholding

    is the transformation of an input imagefto an output (segmented) binary imagegas follows:

  • 8/6/2019 Main body 1 2

    35/93

    35

    where T is the threshold, for image element of object, and for imageelements of the background (or vice versa).

    If objects do not touch each other, and if their gray levels are clearly distinct from

    background gray levels thresholding is a suitable method. A global threshold is determined

    from the whole imagef:

    On the other hand, local thresholds are position dependent image f into sub-images and determined in some sub-image, it can be interpolated fromthresholds determined in neighboring sub-images. Each sub-image is then processed with

    respect to its local threshold.

    Basic thresholding as defined by equations 2.32 has many modifications. one

    possibility is to segment an image an image into regions of pixels with gray- levels from a set

    D and into background otherwise (band thresholding):

    This thresholding can be useful, for instance, for instance, in microscopic blood cell

    segmentations where particular gray-level interval represents cytoplasm, the background is

    lighter, and the cell kernel darker. This thresholding definition can serve as a border detector

    as well; assuming dark objects on light borders. If the gray-level set D is chosen to contain

  • 8/6/2019 Main body 1 2

    36/93

    36

    just these borders gray-levels, and if thresholding according to equation 2.36 is used. There

    are many modifications that use multiple thresholds, after which the resulting image is no

    longer binary, but rather an image consisting of a very limited set of gray-levels.

    (2.37)

    Where each Di is a specified subset of gray-levels

    Another special choice of gray-level subset Di defines semi-thresholding, which is

    sometimes used to make human-assisted analysis easier:

    (2.38)

    This process aims to mask out the image background, leaving gray-level information present

    in the objects. Thresholding has been presented relying only on gray-level image properties.

    Note that this is just one of many possibilities; thresholding can be applied if the

    values do not represent gradient, a local texture property or the value of any imagedecomposition criterion.

  • 8/6/2019 Main body 1 2

    37/93

    37

    2.4.2 Edge-based segmentation

    Edge-based segmentation represents a large group of methods based on information

    about edges in the images; it is one of the earliest segmentation approaches and still remains

    very important. Edge-based segmentation rely on the edges found in an images by edge

    detecting operators, these edges mark image locations of discontinuities in gray-level color,

    texture e.t.c

    There are several edges based segmentation methods which differ in strategies

    leading to final border construction, and also differ in the amount of prior information that

    can be incorporated in to the method. The more prior information that is available to the

    segmentation process, the better the segmentation results that is available to segmentation

    process, the segmentation results that can be obtained. Prior information affects segmentation

    algorithms; if a large amount of prior information about the desired result is available, the

    boundary shape and relations with other image structures are specified very strictly and the

    segmentation must satisfy all these specification. If little information about the boundary is

    known, the segmentation method must take local information about the boundary is known,

    the segmentation method must take more local information about the image into

    consideration and combine it with specific knowledge that is general for an application area.

    If little prior information is available, it cannot be used to evaluate the confidence of

    segmentation results, and therefore no basis for feedback corrections of segmentation is

    available (Sonka et al., 2008).

    The most common problems of edge-based segmentation, caused by noise or

    unsuitable information in an image, are an edge presence in locations where there is no

    border, and no edge presence where a real border exists. Clearly both cases have a negative

    influence on segmentation results.

  • 8/6/2019 Main body 1 2

    38/93

    38

    2.4.2.1 Edge image thresholding

    Almost no zero pixel are present in an edge image, but small edge values correspond

    to non-significant gray-level changes resulting from quantization noise, small lighting

    irregularities, e.t.c. simple thresholding of an edge image can be applied to remove this

    values. This approach is based on an image of edge magnitudes processed by appropriate

    threshold. Selection of appropriate global threshold is often difficult and sometimes

    impossible; p-tilling can be applied to define a threshold and a more exact approach using

    orthogonal basis functions is described in which, if the original basis functions is described in

    which, the original data has a good contrast and is not noisy, gives good results.

    2.4.2.2 Edge relaxation

    Borders resulting from previous method are strongly affected by image noise, often

    with important parts missing. Considering edge properties in the context of their mutual

    neighbors can increase the quality of the resulting image. All the image properties, including

    those of further edge existence, are iteratively evaluated with more precision until the edge

    context is totally clear- based on the strength of edges in a specified local neighborhood; the

    confidence of each edge is either increased or decreased.

    A weak edge positioned between two strong edges an example of context; it is highly

    probable that this inter-positioned weak edge should be part of a resulting boundary. If, on

    the other hand, an edge (even a strong one) is positioned by itself with no supporting context,

    it is probably not a part of any border.

  • 8/6/2019 Main body 1 2

    39/93

    39

    2.4.3 Region-based segmentation

    Region growing techniques are generally better in noisy images, where borders are

    extremely difficult to detect. Homogeneity is an important property of regions and is used as

    the main segmentation criterion in region growing, whose basic idea is to divide an image

    into zones of maximum homogeneity. The criteria for homogeneity can be based on gray-

    level, color, texture, shape model (using semantic information), e.t.c. properties chosen to

    describe regions influence the form, complexity, and amount of prior information in the

    specific region-growing segmentation method.

    Region growing segmentation must satisfy the following condition of complete

    segmentation:

    Where S is the total number of regions in an image and H (Ri) is a binary homogeneity

    evaluation of the region Ri. Resulting regions of the segmented image must be both

    homogenous and maximal, where by maximal we mean the homogeneity criterion would

    not be true after merging a region with any adjacent region. The homogeneity criterion uses

    an average gray-level of the region, its color properties, or m-dimensional vector of average

    gray values for multi-spectral images.

    2.5 Image Analysis/Classification/Interpretation

    For some applications the feature, as extracted from the image are all that is required.

    Most of the time however one more step must be taken; classification interpretation.

    The most important interpretation method is conversion of units. Rarely will dimensions in

    pixels or gray level be appropriate for an industrial application. As part of the software, a

  • 8/6/2019 Main body 1 2

    40/93

    40

    calibration procedure will define the conversion factors between vision system units (Nello,

    2000).

    Reference point and other important quantities are occasionally not visible on the

    part, but must be derived from measurable features. For instance, a reference point may be

    defined by the axes of the tubes on either side of the bend).Error checking, or image

    verification, is a vital process. By closely examining the features found, or extracting

    additional feature, test the image itself to verify that it is suited to the processing being done.

    Since features are being checked, it can be considered a classification or interpretation step.

    Without this, features could have incorrect values because the parts is mislocated, upside

    down or missing, because a light has burned out, because the lens is dirty, e.t.c. a philosophy

    of fail-safe programming should be adopted; that is any uncertainty about validity of the

    image or the processing should either reject parts or shut down the process. This is

    imperative in the process control, process verification, and robot guidance, where safety is at

    risk. Unfortunately, error checking procedures are usually specific to a certain type of image;

    general procedures are not available.

    2.6 Decision Making

    Decision making in conjunction with classification and interpretation, is characterized

    as heuristic, decision theoretic, syntactic or edge tracking. The most commonly used decision

    techniques will be discussed

    2.6.1 Heuristic

    In this case, the basis of the machine vision decision emulates how humans might

    characterize the image as such intensity histogram, black-white/black-white transition count,

    pixel count, Background /fore ground pixel maps, average intensity value, delta or

  • 8/6/2019 Main body 1 2

    41/93

    41

    normalized image intensity pixel maps, X number of data point, each representing the

    integration over some area in the picture row/column totals.

    Often times systems are designed to handle decision making to a specific duration of

    time. For example some companies have these programs in hardware and consequently, can

    handle some decision making as 3000 per minute. These systems typically operate in a train

    by showing technique. During training (sometimes called learning), a range of acceptable

    representative is shown to the system, and the representation, which is to serve as a standard,

    is established. The representation may be based on a single object or on the average of the

    images from many objects or may include a family of known good samples, each creating a

    representation standard to reflect the acceptable variables.

    In operating mode, decision-making based on how close the representations from the

    present object being examined compares to the original or standard representation(s). A

    goodness-of fit criterion is established during training to reflect the range of acceptable

    appearances the system should be tolerant of. If the difference between the representation

    established from object under test and the standard exceeds the goodness-of-fit criteria, it is

    considered a reject. Significantly, the decision may be based on a combination of criteria

    (pixel counts and transition count, for example). The goodness-of-fit criteria then become

    based on statistical analysis of the combination of each of the fit criteria.

    Decision-making, in conjunction with these approaches, can be either deterministic or

    probabilistic. Deterministic means that given some state or set of conditions, the outcome of

    a function or process is fully determined with 100% probability of the same outcome.

    Probabilistic means that a particular outcome has some probability of occurrence (100%),

    given some initial set.

  • 8/6/2019 Main body 1 2

    42/93

    42

    2.6.2 Syntactic Analysis

    The ability to make decisions based on pieces of an object is usually based upon

    syntactic Analysis, unlike the decision theoretic approach. In this case, the object is

    represented as a string, a tree, or graph of pattern primitive relationships. Decision making is

    based on a parsing procedure. Another way to view this is a local features analysis (LFA) - a

    collection of local features with specified spatial relationships between various combinations.

    Again these primitives can be derived from a binary or grayscale images thresholded or edge

    processed.

    For example, three types of primitive include curve angle and line that together can

    be used to describe a region. Image analysis involves decomposing the objects into its

    primitives, and relationships of primitive results in recognition. The primitive decision

    making can be performed using decision theoretic or statistical techniques.

    2.6.3 Edge tracking

    In addition to geometric feature extraction of boundary images, image analysis can be

    conducted by edge tracking: when the edge is detected, it is stored as a link of edge points.

    Alternatively, line encoding and connectivity analysis can be conducted. That is the location

    of edge points detected is stored and line fitting is performed (Zuechi, 2000).

    Decision making is then based on comparism of line segments directly or based on

    probability theory. Line segment descriptions of objects are called structural description. This

    process of comparing them to models is called structural pattern recognition.

  • 8/6/2019 Main body 1 2

    43/93

    43

    2.7 Related Works

    A thorough survey of work related to eye tracking techniques and eye-blink detection system

    will discussed in detail.

    2.7.1 Eye tracking technique

    Tian et al. (2000a) presents a dual-state eye model, there are two templates - for

    closed and open eye. The template for the open eye consists of a circle and two parabolic

    arcs. The circle, described by three parameters x0, y0, r ((x0, y0) the centre and r the

    radius), represents the iris. The arcs represent the eyelids. They are described by 3 points

    one for each eye corner and one on the apex of the eyelids. The template for a closed eye is a

    straight line between the eye corners. If the iris is detected, the eye is open and modeled with

    a template for an open eye, otherwise it is closed. They assume that the eye features are

    given on the first frame. The inner corners are tracked by minimizing the squared difference

    between the intensity values regions close to the corners in two subsequent frames. The outer

    corners are detected as first, lying on the line between the two inner corners and second stay

    apart from them in width w (certain value obtained on the first frame). After botheye corners

    are fixed, to complete the eye tracking, the eyelids have to be localized. It is done by tracking

    central points on both eyelids by minimizing the squared difference between the intensity

    values. They tested their method over 500 image sequences where the full-size face takes

    220x300 pixels and each eye region - 60x30. The method works robustly and accurately

    across race and expressions variety and make-up presence.

    Tian et al. (2000b) have developed a system for recognizing three action units

    completely closed, narrow-open and completely open eye by use of Gabor wavelets in nearly

    frontal image sequences. The feature points are three inner corner, outer corner and middle

  • 8/6/2019 Main body 1 2

    44/93

    44

    point between the first two. The eye corners are tracked in the whole sequence. The most

    important of them for tracking is the inner corner; the positions of the others are relatively

    found by the inner corner position. The initial positions of the feature points are given. Then

    it is tracked by minimizing thefunction over a certain displacement. The function depends of

    the intensity values (grayscale values). The outer corners are detecting by using the size of

    the eyes, obtained on the first frame. The middle point is the point in the middle between the

    inner and outer corner. For each of these three feature points a set of multi-scale and multi-

    orientation Gabor coefficients are calculated. Three spatial frequencies with wavenumber and

    six orientations from 0 to differing in /6 are

    used. These 18 coefficients are charged into a neural network to determine the state of the

    eye. Unfortunately, only the success of detection the eye state (not of the eye corners

    tracking) is reported. Since this is the main aim of the paper. The recognition rate when three

    action units are recognized is 89% and when only two (the narrow eye equal to closed) it

    increases to 93%.

    Sirohey et al. (2002) present a flow-based method for tracking. Their method for

    detection is based on finding a combination of edge segments, which represents upper eyelid

    the best. First the head motion is detected from the edge segments associated with the

    silhouette of the head. Based on this information the head is stabilized. The head motion

    vectors are subtracted from the iris and eyelid motion and only their independent motions are

    left. The eyelids are tracked as follows: The edge pixels of the eyelid that have flow vectors

    associated with them are followed according to the direction and magnitude of the flow

    vector. If edge pixels are found in close neighborhood to the pointed pixels, they are labeled

    as possible belonging to the eyelid. The candidates are fitted into third-order polynomial.

  • 8/6/2019 Main body 1 2

    45/93

    45

    With this method the iris is found at each frame correctly and the eyelids - in 90% of the

    frames (two sequences of 120 frames of single person, with and without glasses). The paper

    does not mention how the lower eyelid is modeled, extracted and tracked. Blinking is

    detected as the height of the apex of the upper eyelid from iris center.

    Black et al. (1996) explore a template-based approach combined with optical flow

    (for example affine), in which they represent rigid and deformable facial motions using

    piecewise parametric models of image motion. The facial features face, eye regions,

    eyebrows and mouth are given. The face is broken down into parts and the motion of each of

    them is modeled independently by planner motion models. The affine model is sufficient to

    model the eye motion.

    where u, v horizontal and vertical components of the flow at image point p(x, y). The

    coordinates are defined with respect to some image point (typically center of the region).

    The difference between the image and the change of the parameters obtained on the

    previous is minimized by simple gradient descent scheme. The eye state transition can be

    described by tree parameters vertical translation, divergence (isotropic expansion) and

    rapid deformation (squashing and stretching), which interpretation is given in Table 2.1.

    The curves of all three are plotted against time and are observed for local maximums

    and minimums. The changes in function have to appeared nearly on the same time. A eye

    blink is detected when translation max & divergence min & deformation max.. The reported

    accuracy of 88% for artificial sequences and 73% for TV movies is measured for all facial

    expressions. Unfortunately the achieved processing time is 2 min/frame, which is not

    applicable for real-time applications.

  • 8/6/2019 Main body 1 2

    46/93

    46

    *Table 2.1: Parameters describing the movement in the eye region

    *source: Blacket al. (1996)

  • 8/6/2019 Main body 1 2

    47/93

    47

    Cohn et al. (2004) and Moriayama et al. (2004) present different aspects if the same

    system, where a carefully detailed generative eye model (see Fig. 2.10) is used. A template is

    built with the usage of two types of parameters structure and motion. The structure

    parameters describe the appearance of the eye region, capturing all its racial, individual and

    age variations. This includes size and color of the iris, sclera, dark regions near left and right.

    corners, the eyelids, width and boldness of the double-fold eyelid, width of the bulge below

    the eye, width of the illumination reflection on the bulge and furrow below the bugle. Motion

    parameters describe the changes during time. Traditionally the movement of the iris is

    described by 2D position of its center

    Closing and opening the eye is shown with the height of the eyelids. The skew of the

    upper eyelid is also motion parameter to catch changing of the upper eyelid when the eyeball

    is moving. Unfortunately, the structure parameters are not automatically implemented. The

    model is individualized by manually adjusting the structural parameters. From the

    initialization they derived structural parameters, which remain fixed for the entire sequence

    later. Further the features are tracked with iterative minimization of mean square error

    between the input image and the template, obtained by current motion parameters.

    According to Cohn et al. (2000) the model for tracking the eye features and blink

    detection is a part of a system for automatic recognizing of the embarrassing smiles. They

    tested a hypothesis that there is correlation between head movement, eye gaze and lip

    displacement during embarrassing smiles. That is why probably the accuracy of the tracking

    method is not measured and is not reported. The second one reports for failure in only 2

    image sequences from 576, which happens due to the head tracker. The database includes a

    variety of subjects from different ethnic groups, ages and gender. The in-plane and limited

    out-of-plane motion is included.

  • 8/6/2019 Main body 1 2

    48/93

    48

    Fig. 2.10: Detail template used ( Moriayama et al. (2004) and Cohn et al. (2004))

  • 8/6/2019 Main body 1 2

    49/93

    49

    An active contour technique is applied by Paradas (2000) to track the eyelids. The model for

    the eye consists of two curves one, for the lower eyelid, with one minimum and one, for the

    upper eyelid, with one maximum. Tracking of the eyelids is done with active contour

    technique where the motion is embedded in the energy minimization process of the snakes.

    A closed snake, which tracks the eyelids, is built by selection a small percentage of

    the pixels along the contours obtained during initialization or tracked on the previous frame.

    Among these points are the eye corners. Motion compensation errors are computed for each

    snaxel (x0, y0) within given range of allowed displacement (dx, dy). Those pixels (x0+dx,

    y0+dy), which produce the smaller computational error, are selected as possible candidates of

    the snaxel (x0, y0) at the current frame. A two-steps dynamic programming algorithm is run

    for these candidates.

    The paper does not report anything about the running time of the algorithm. The author only

    mentions that it is stable against blinking, head translation and rotation, up to the extent,

    where the eyes are visible.

    2.7.2 Blink detection systems

    Very briefly, I would like to mention the ways, in which the authors of the revised

    papers detect blinking.

    In Tianet al. (2000) blinking is detected if the iris is not visible. This is not the most

    appropriate way. If it is assumed that the method of iris detection never fails, and thus gives

    false alarms, misclassification might occur because of eye or iris occlusion because of the

    head rotation.

    Sirohey et al. (2002) detects blinking occurrence as the height of the apex of the

    upper eyelid from iris center, which might be a consequence of not tracking the lower eyelid.

  • 8/6/2019 Main body 1 2

    50/93

    50

    The extension of Tians approach (2000) is a paper by Cohn et al. (2002). It focuses

    on blink detection, and not on locating the eye features. The eye region is defined on the first

    frame manually picking 4 points the eye corners the centre point of the upper eyelid and a

    point straight under it. It stays the same within the whole image sequence, because the face

    region is stabilized. Eye region is divided into two portions the upper and the lower by the

    line connecting the eye corners. Blink detection relays on the fact that the intensity

    distributions of the upper and the lower part change when the eye is opening and closing. The

    upper part consists of sclera, pupil, eyelash, iris and skin. For all of them only the first and

    the last (sclera, skin) contribute for increasing of the average intensity values. When the

    upper eyelid is closing, the eyelash is moved in the lower region and the pupil and iris are

    replaced by brighter skin, which leads to increasing the overall intensity of the average

    intensity of the upper portion and simultaneously decreasing the average intensity of the

    lower. The average grey scale intensities of the both portions are plotted against time. The

    eye is closed when the curve of the upper has a maximum. The blink is detecting also by

    counting the number of crossings and the number of peaks in order to distinguish between

    blinking and eyelid flatter. If the blinking is undergoing between two neighbor crossings

    there is only one peak, otherwise the peaks are more than one.

    Correlation with a template of the persons eye is used in the paper by Grauman

    (2001) as classifying the state of the eye. The difference image during first several blinks is

    used to detect the eye regions. The candidates are discarded based on the anthropomorphic

    measures, as distances between the blobs, their width and height should keep a certain ration

    and others. The remaining pairs candidates are classified on the Mahalanobis distance

    between their parameter vector and a mean vector of blink-pair property vector. The

    bounding box of the detected eye region determines the template. Further, it is decided for

  • 8/6/2019 Main body 1 2

    51/93

    51

    eye blinking by calculating the correlation between this template and the image on the

    current frame. As the eye closes, it begins to look less like the template eye and otherwise

    when it reopens - more and more similar. The correlation score ranging between 0.85 and 1

    classifies the eye as open, the range between 0.55 and 0.8 as closed eye and less than 0.4

    the tracker is lost. Again, the technique is appropriate only for blink detection, not on precise

    eye feature extraction and tracking. The reported overall detection accuracy is 95.6% on

    average 28 frames per second. This result might change for a longer image sequences as

    using for detection driver drowsiness, because a template for a single person expires in time,

    when the person gets tired.

    Ramadan et al. (2002) used the active deformable model technique to track the iris.

    A statistical pressure snake, where the internal forces are eliminated, tracks the pupil. The

    snake expands and closes the pupil. When the upper eyelid occludes the pupil, i.e. the blink is

    undergoing, the snake collapses. The duration of snake collapse is measurement for blink.

    After reopening the eye the snake can expand itself again if the iris position is not changes

    during blinking, otherwise it has to be initialized (position) manually. Although they reported

    very high accuracy of the tracking method, the system suffers by several disadvantages. The

    main problem is manual initialization and re-initialization. Further the way of measuring

    blinking also does not seem to be very reliable. The snake might collapse in case of saccades,

    which will be misunderstanding with blinking. The third one is the position of the camera. It

    is attached to the head, which restricts to the head movements and also makes the equipment

    to applicable for drivers.

    Danisman et al. (2010) presented an automatic drowsy driver monitoring and accident

    prevention system that is based on monitoring the changes in the eye blink duration. His

    proposed method detects visual changes in eye locations using the proposed horizontal

  • 8/6/2019 Main body 1 2

    52/93

    52

    symmetry feature of the eyes. This new method detects eye blinks via a standard webcam in

    real-time at 110 fps for a 320240 resolution. Experimental results in the JZU eye-blink

    database showed that the proposed system detects eye blinks with 94% accuracy with a 1%

    false positive rate.

  • 8/6/2019 Main body 1 2

    53/93

    53

    CHAPTER THREE

    ALGORITHM DEVLOPMENT

    3.1 System Flowchart

    The flowchart diagram in Fig. 3.1 describes the processes involved in drowsiness

    detection of the system. The image is acquired with the aid of a digital camera, which

    converts the acquired image to grayscale. The searching for the location of the eye is

    initialized by analyzing the involuntary blinks of the user of the system; this is achieved by

    motion analysis technique. An online template of the eye is created which is used to update

    the position of the eye every thirty seconds caused by slight movement of the drivers head.

    Anytime the output of the tracker is lost, the system re-initializes itself by automatically

    repeating the process. Once the tracking is successful the system proceeds to extract the

    visual cue of the drivers eye by detecting the number of blinks produced. This acquired

    information is used to take a decision of when to trigger the alarm. In other words drowsiness

    detection keeps track of the number of blinks produced by the user. When the number of

    blinks gets to a critical point, which translates into detecting short period of micro-sleep, the

    alarm is triggered, if the alarm is not triggered within five minutes the system is designed to

    reset itself automatically.

    3.2 Software Development

    The algorithm was developed in C language in the visual studio environment

    interlinked with the OpenCV library which is mainly used for image processing and in the

    area computer vision. The algorithm is broken down into five processes which are namely:

    eye-detection, template creation, eye tracking, blink detection and drowsiness detection.

  • 8/6/2019 Main body 1 2

    54/93

    54

    Fig. 3.1: Flowchart diagram describing of the eye-blink detection system

    Image Acquisition

    Eye detection

    Eye tracking

    Success

    Blink detection

    Activate alarm

    STOP

    Drowsiness

    detection

    NO

    YES

    YES

    NO

  • 8/6/2019 Main body 1 2

    55/93

    55

    A number of significant contributions and advancement has been made to the works of

    Grauman et al., (2001) in other to improve on the accuracy and reliability of the system

    which will be discussed in detail.

    3.3 Eye-Detection

    The system will try to locate the position of the eye by analyzing the blinking of the

    user in this stage, this is achieved by creating a difference image from the current frame and

    previous gray-scaled frame of the driver, the gray-scaled image undergoes binarization.

    Binarization is the conversation of gray-scaled image to a binary image which is often used

    to show regions of significant movement in the scene. A binary image is an image in which

    each pixel assumes the two discrete values in this case 0 and 1; 0 representing black and 1

    representing white after thresholding.

    The next phase in this stage is to eliminate noise which is often caused by naturally

    occurring jitter caused by lighting conditions and camera resolution. We employ some

    functions in the OpenCV library, this provides a fast, convenient interface for doing

    morphological transformations on image this is called dilation and erosion. They remove

    noise and produce fewer and larger connected components. The resultant 3x3 star-shaped

    convolution kernel is passed over the binary image in an opening morphological operation.

    Listing 1 in Appendix A shows the algorithm for the opening morphological operation.

    Candidates eye-blobs are extracted by recursive labeling of the connected

    components of the produced binary image. We then determine whether the connected

    component is an eye-pair or not i.e. the system is able to consider if each eye pair is a

    possible match for users eye. The algorithm for connected component labeling is shown in

    Listing 2 in Appendix A.

  • 8/6/2019 Main body 1 2

    56/93

    56

    A number of experimentally-derived heuristics is applied based on the width, height,

    vertical distance and horizontal distance to pinpoint the exact pair that most likely represents

    the drivers eye (Chau et al., 2005). The system proceeds if the number of connected

    components is two or otherwise the process re-initializes itself. This is achieved by some sets

    of defined rules such as the width of the components must be about the same, the height of

    the connected components must be about the same, and the vertical distance must be small;

    this is scrutinized by some set of filters. If these pairs of component pass through the set of

    filters, then there is a good indication that the drivers eye has been successfully located.

    The name given to this technique is known as motion analysis.

    Connected component labeling is applied next to obtain the number of connected

    components in the difference image. Fig. 3.2 shows the thresholded difference image prior to

    erosion.

    3.4 Template Creation

    After the connected components have successfully passed through the filter, the larger

    of the two components will be chosen, for template creation, due to the fact that size of the

    template to be created is directly proportional to the chosen components. The larger the

    component chosen the more the brightness information it contains. This will result in more

    accurate tracking and, hence, the system obtains the boundary of the selected component,

    which will be used to extract a portion of the current frame as the eye template. Since we

    need an open eye template, it will be a mistake to create a template the moment the eye is

    located. This is because blinking involves closing and opening of the eye and, thus, once the

  • 8/6/2019 Main body 1 2

    57/93

    57

    Fig. 3.2: Transition during the eye detection using the motion analysis technique

  • 8/6/2019 Main body 1 2

    58/93

    58

    eye is located, we set some delay before creating the template. Following this, therefore, we

    need an open eye template. Since the user's eyes are still closed at the heuristics filtering

    above, there is a need to wait a moment for the user to open his eyes. Listing 4 in Appendix

    B shows the algorithm used in creating an online template of the eye.

    3.5 Eye-Tracking

    Eye detection is not sufficient to give an highly accurate blink information desired,

    since there is possibility of head movement from time to time. A fast tracking procedure is

    needed to maintain the exact knowledge about the eyes appearance. So having the eye

    template and live video feed from camera makes it possible for the system to locate the user's

    eye in the subsequent frames using template matching.The searching is limited in a smallsearch window since searching the whole image will use extensive amount of CPU resources.

    The system utilizes the square difference matching method, which matches the

    squared difference so that a perfect match will be zero and bad matches will be large (Gray,

    2008). The equation is given by:

    (3.1)Where are the brightness of the pixels at in the template and source imagerespectively, and

    is the average value of the pixels in the template raster and

    is the average

    value of the pixels in the current search window of the image. At any time the squared difference

    exceeds a predefined threshold the tracker is believed to be lost, for this event it is critical

    that the tracker declares itself lost and re-initialize using by going back to eye detection by

  • 8/6/2019 Main body 1 2

    59/93

    59

    motion analysis technique. Fig. 3.3 shows the sample of the tracked object. Listing 5 in

    Appendix A shows the algorithm for locating the eye in subsequent frames, the location of

    the best matches is available in minloc. It is used to draw rectangle in the displayed frame to

    label the object being tracked as shown in Fig. 3.3

    3.5 Blink Detection

    A human being must periodically blink to keep his eyes moist. Blinking is

    involuntary and fast. Most people do not notice when they blink. However, detecting a

    blinking pattern in an image sequence is an easy and reliable means to detect the presence of

    a face. Blinking provides a space-time signal which is easily detected and unique to faces.

    The fact that both frame of tracked objects in which is the eye. Listing 5 in Appendix A

    shows the algorithm for locating the eye.

    The algorithms developed in previous works (Grauman et al., (2001) and Chau et al.,

    (2005)) eye blinks were detected by the observation of correlation scores such that the

    detection of blinking and the analysis of blink duration are solely based on observation of the

    correlation scores. This is generated by the tracking the previous step by using the online

    template of the users eye. In other words as the users eye closes during the process of a

    blink, its similarity to the open eye template decreases. As the users eye is in the normal

    open state, very high correlation scores of about 0.85 to 1.0 are reported. As the user blinks,

    the scores fall to values of about 0.5 to 0.55.

    In theory, when the user blinks, the similarity to the open eye template decreases.

    While it is true in most cases, weve found that it is only reliable if the user does not make

    any significant head movements. If the user moves his head, the correlation score also

  • 8/6/2019 Main body 1 2

    60/93

    60

    Fig.3.3: Sample frames of the tracked object

  • 8/6/2019 Main body 1 2

    61/93

    61

    decreases even if the user doesn't blink. In this system we use motion analysis to detect eye

    blinks, just like the very first stage above. Only this time the detection is limited in a small

    search window, the same window that is used to locating the user's eye. Listing shows the

    algorithms for blink detection using motion analysis.

    From Listing 6 in Appendix A, cvFindContours will return the connected components in

    comp, and the number of connected components nc. To determine whether a motion is eye

    blink or not, we apply several rules for the connected component: There is only one (1)

    connected component; the component is located at the centroid of user's eye.

    Note that we require only one (1) connected component, while normally user blink

    will yielding two (2) connected components. That's because we perform the motion analysis

    in a small search window, where the window fits only for one (1) eye.

    3.6 Drowsiness Detection

    Driver drowsiness is one specific human error that has been well studied. Studies

    have shown that immediately prior to the accidents, the driver's eye change in blinking

    behavior (Thorslund, 2003). The basic parameter used to detect drowsiness is the frequency

    of blinks, the system detects micro-sleep symptoms in order to diagnose driver fatigue. As

    the driver fatigue increases, the blinks of the driver tend to last longer and drowsiness

    gradually sets in.

    Mathematically:

    Where f is the frequency of blinks

  • 8/6/2019 Main body 1 2

    62/93

    62

    T is the blink duration

    In other words as the frequency of blinks is inversely proportional to blinking duration,

    When a person is highly alert the duration of blinking in a circle is relatively high but as

    drowsiness set in the blinking duration drop relatively, invariably this means that when

    blinking duration is high the frequency of blinks is low and vice-versa.

    The system determines the blink rate by counting the number of consecutive frames in which

    the eye remain closed. The system is design to trigger a warning signal via an alarm once the

    early stages of drowsiness is detected.

    3.7 Hardware consideration

    The system is primarily developed and tested on windows vista PC with AMD Turion

    3GHz processor and 2 GB RAM. Video was captured with a color CMOS image sensor type

    Averon webcam, which captures at 30 frames per second, also processes video as grayscale

    images at 320 x240 pixels, resolution of 1.3 mega pixel and signal to noise Ratio of 48dB

    which enhance the accuracy of the system.

  • 8/6/2019 Main body 1 2

    63/93

    63

    CHAPTER FOUR

    RESULTS AND DISCUSSION

    In order to ascertain the reliability of the system performance evaluation was carried

    out. Furthermore, compatibility test was carried out on the system. And it was discovered

    that it is compatible with operating systems like windows XP and windows vista and

    performed satisfactorily well.

    4.1 Blink Detection Accuracy

    The blink detection accuracy was conducted, using 10 different test subjects, since a

    more standard measure of the overall accuracy of the system is across a broad range of users.

    In order to measure the detection accuracy, video sequence were captured of each of the test

    subjects sitting at 60m away from the camera. They are asked to blink naturally but

    frequently and exhibit mild head movements.

    A total of 500 true blinks of 10 test subjects were analyzed, in which each test subject

    produced 50 blinks. During this evaluation session, the system encountered two types of

    errors which are the missed blink error and false positive blink error. Missed blinks occur as

    a result of the system not being able to detect the subjects blink when there was actually a

    blink. False positive blinks occur when the system detects a blink when there was none

    produced by the test subject.

    Twelve (12) blinks were missed out of 500, resulting in an initial accuracy of 97.6%.

    Furthermore 15 false positive blinks were encountered making the overall accuracy of the

    system to be 94.6%. The accuracy of the system and errors encountered in the system are

    calculated below:

  • 8/6/2019 Main body 1 2

    64/93

    64

    Table 4.1 shows the summary of result, from the foregoing, the capture rate of the camera,

    which is 30 frame/seconds, was used to produce a blink accuracy of 94.6% with a 3% false

    positive error. This result is comparable to the work of Danisman et al. (2010), which

    employed a camera with a capture rate of 110 frame/seconds in order to obtain an accuracy

    of 94.8% and a 1% false positive error.

  • 8/6/2019 Main body 1 2

    65/93

    65

    Table 4.1: Summary of results

    Total number of blinks analyzed 500

    Total missed blinks 12

    Total false positive blinks 15

    % percentage initial accuracy of the system 97.6%

    % Overall accuracy of the system 94.6%

  • 8/6/2019 Main body 1 2

    66/93

    66

    4.2 Eye tracking accuracy

    This experiment was conducted by placing the test subject at varying distance from

    the camera. A time constraint of 30 seconds was placed on the system to effect the automatic

    initialization of the tracker, which consist of two small bonding boxes, which tends to appear

    on the image. If the tracker does not appear within 30 second, the tracker is believed to be

    lost. This was conducted with distances of 30cm, 60cm, 90cm, 150cm and 180cm.

    A log reading of the number of times the tracker appears in thirty seconds and expressed in

    percentage which is given as:

    Fig. 4.1 shows a plot of the percentage tracking accuracy against distances from the plot, at a

    distance of 30cm the accuracy is 72%, and at a distance of 150cm the accuracy drops to 10%.

    The tracking accuracy of the system enables us to ascertain the sensitivity of the

    system at varying d