26
MODEL-BASED HUMAN EAR LOCALIZATION AND FEATURE EXTRACTION Ernő Jeges, Budapest University of Technology and Economics, Hungary, [email protected] László Máté, Search-Lab Ltd, Hungary, [email protected] ABSTRACT Nowadays the viability of ear-based biometric identification and the uniqueness of ears is beyond question, but reliable technical solutions as the basis for a successful commercial application have yet to appear. As opposed to face recognition, in which a model-based approach is widely used, surprisingly little effort has been put into using ear models in automatic identification, even though ear shape is more robust than facial characteristics, being unaffected by emotional expressions or other changes like facial hair or glasses. In this paper we would like to introduce our latest work in a model-based approach for ear identification, featuring schemes for both ear localization and feature extraction. The first tests of our proposed solution have proved that the method is accurate enough to allow its application in intelligent video surveillance, where people can be remotely identified and their identity can be continuously tracked using the ears visible on video surveillance camera images. KEYWORDS: ear biometrics, image processing, model-based identification, active contours, intelligent video surveillance

Ern ő Jeges, Budapest University of Technology and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

MODEL-BASED HUMAN EAR

LOCALIZATION AND FEATURE EXTRACTION

Ernő Jeges, Budapest University of Technology and Economics, Hungary, [email protected]

László Máté, Search-Lab Ltd, Hungary, [email protected]

ABSTRACT

Nowadays the viability of ear-based biometric identification and the uniqueness of ears is

beyond question, but reliable technical solutions as the basis for a successful commercial

application have yet to appear. As opposed to face recognition, in which a model-based

approach is widely used, surprisingly little effort has been put into using ear models in

automatic identification, even though ear shape is more robust than facial characteristics,

being unaffected by emotional expressions or other changes like facial hair or glasses. In

this paper we would like to introduce our latest work in a model-based approach for ear

identification, featuring schemes for both ear localization and feature extraction. The first

tests of our proposed solution have proved that the method is accurate enough to allow its

application in intelligent video surveillance, where people can be remotely identified and

their identity can be continuously tracked using the ears visible on video surveillance

camera images.

KEYWORDS: ear biometrics, image processing, model-based identification, active

contours, intelligent video surveillance

1. INTRODUCTION

Identification – the basis of every access control system – can be accomplished by

knowledge- (password), possession- (key) and biometric-based methods. Unfortunately,

passwords or keys involve an unavoidable weakness: they are only tenuously linked to

their owners. In contrast to this, biometric identification methods directly check the

identifiable person, which is especially useful if we want to do it in a passive and remote

way. For this the ear provides promising identification features, but although ear prints

have already been used as evidence in criminal cases, up to now relatively little effort has

been invested in automating ear-based identification.

Ear-based identification is a relatively new method among biometric techniques.

Comparing face and ear identification, it is obvious that although people use faces in

everyday life to recognize their acquaintances, one would not normally attempt to

identify another person by their ears; in automated identification, however, ears can be

used more easily and reliably than faces. It is generally accepted that any given person’s

ear shape is unique (see Figure 1 below); therefore computer algorithms can identify the

differences, as they can recognize and extract the various distinctive features on ear

images in order to distinguish different ears, and thus identify different people.

Figure 1. Three image samples of ears from different persons

Alfred Iannarelli was the pioneer in using ear features to identify people, developing

his forensic method in 1949. He manually measured the distances between different parts

of the ear, and collected an ‘ear database’ containing more than ten thousand ear images

[1]. However, Iannarelli’s method was very limited, as it allowed for identification within

a population of not more than 16.7 million (412), since his metrics extracted only 24 bits.

Moreover, his measurement needed a precisely determined base point, which made his

method even harder to apply in automatic recognition. After Burge and Burger’s late ’90s

publication [2] on automating ear biometrics suggested the use of Voronoi diagrams, a

multitude of studies appeared, based on various approaches; one of the most recent

important studies in this field was by Hurley, Nixon and Carter, who used force-field

transformation and feature extraction [3].

As the model-based approach is widely used in face recognition, it was surprising to

realize that this approach had not yet been applied to ear-based identification, even

though the ear contains more distinctive and robust features. To exploit this robustness,

we chose to use some a priori knowledge about the ear’s geometry in the form of an ear

model, in order to establish a new approach in automatic ear-based identification for both

ear localization and the extraction of ear features.

2. REMOTE IDENTIFICATION OF HUMANS

In our longer term targeted human identification and identity tracking application [4],

ear-based identification is integrated into a video surveillance system. Thus the

identification process starts with capturing frames from a surveillance camera,

presumably containing the image of the head of the person to be identified; this is

especially important in our case, as either the ear or the face is usually visible in an image

of a person’s head.

The draft architecture of the ear-based identification subsystem within an intelligent

video surveillance system is shown in Figure 2. The process starts with capturing the

frames from a surveillance camera, from which the moving shapes are segmented using

continuously synthesized background information. On the segmented shapes we can

either locate the target for identification (ear, face or other) directly, or we can first locate

only the head, the position of which can drive an active (e.g. pan-tilt-zoom) camera in

order to get close-up pictures from the desired identification target. In the first case we

need high-resolution cameras, and can also use the head localization information for

further localizations; in the second case we need an active camera, from which the close-

up frames can be used directly to locate ear, face or other targets of interest.

Figure 2. Remote human identification within the intelligent video surveillance system

In this article we shall introduce our latest methods used in the last two modules of

this architecture, Identification target localization and Feature extraction implemented

for ears.

2.1. Ear-based remote identification

In the forthcoming sections we shall give an overview of the proposed ear localization

and feature extraction methods, both of which rely on a pre-defined model of the ear. The

localization is based on edge orientation pattern matching, while for feature extraction we

use the active contour (or “snake”) technique to determine the best-fitting model and

obtain the model deformation parameters as features of the ear.

Edge orientation pattern matching has been widely used in face localization recently,

as its efficiency is not inferior to other methods, while its speed is outstanding [6][7]. The

main idea is to prepare an edge orientation pattern template for the object to be localized,

and search for this template on the target image, on which the actual edges and their

orientations are also detected.

Active contour is a method that uses a deformable model built up from curve segments

to track a shape in motion. It was initially used by Kass, Witkin and Terzopoulos [5]; the

main idea is to define a model for which the inherent internal forces prevent its shape

being affected by arbitrary deformations, while the external forces attract the shape

segments, trying to move and deform it to best fit the edges on the underlying image.

Therefore the internal forces are defined as constraints on the parameters of the curve,

while the external forces are computed from the pixels of the detected edges on the

tracked image. The curve segments tend to their equilibrium position (in which internal

and external forces are equal), and thus the shape in motion is continuously tracked by

the deforming model. This is especially important, as we use surveillance camera

pictures, on which it is essential to track the ear with the model in real time after it has

been initially localized (see [8] for details of the active contour method).

In the present article we shall describe in detail the methods used in the modules

shown on the ear-based identification architecture in Figure 3 below. The whole process

includes the main steps of ear localization and feature extraction, and ends with

identification. For ear localization, we prepare the image using an image pre-processing

step, after which the edges and their orientations are detected and the ear is localized on

the image. Localization ends with a geometric normalization of the detected ear. The

feature extraction process starts with the processing of the normalized edges, so that the

iteration can start on fitting the ear model to the edges.

The localization and feature extraction steps described above can be accomplished for

several images taken from subsequent frames of a video sequence, so that in the end we

can calculate the average of the feature vectors extracted from several independent

samples. Finally we can make the identification decision, determining whether or not the

feature vectors originate from the target person (ear).

Figure 3. Steps for ear localization (above) and feature extraction (below)

3. EAR LOCALIZATION

As shown above, the whole localization process consists of the following steps: (1)

image pre-processing, (2) edge detection, (3) localization itself and (4) geometric

normalization.

3.1. Image pre-processing

As with the majority of our image processing methods, the first step is to make a

grayscale image from a color one. For this we simply deduce an average of the RGB

color components for every pixel.

3.2. Edge detection

For edge filtering, we detect the edges by simultaneously using eight linear filters (sized

5x5 pixels) each detecting one of eight cardinal directions. A sliding window moves

across the image, within which the eight filters analyze the neighborhood of every pixel

in the image. The characteristic orientation for a certain position is determined by

choosing the direction associated to the filter which provides the maximum result at that

position.

0,-1,-1, 0, 1, 0, 1, 1, 0,-1, 0,-1,-1,-1,-1, 0, 1, 1, 1, 1, -1,-1, 0, 0, 1, 1, 1, 0, 0,-1, -1,-1,-1, 0, 0, 1, 1, 1, 0, 0, -1,-1, 0, 1, 1, 1, 1, 0,-1,-1, -1, 0, 0, 0, 1, 1, 0, 0, 0,-1, -1, 0, 0, 1, 1, 1, 0, 0,-1,-1, 0, 0, 1, 1, 1, 0, 0,-1,-1,-1, -1, 0, 1, 1, 0, 1, 0,-1,-1, 0, 1, 1, 1, 1, 0, -1,-1,-1,-1, 0, -1,-1,-1,-1, 0, 1, 1, 1, 1, 0, 1, 0,-1,-1, 0, -1, 0, 1, 1, 0, 0, 0,-1,-1,-1, 0, 0, 1, 1, 1, 1, 0, 0,-1,-1, -1, 0, 0, 1, 1, 1, 0, 0, 0,-1, -1, 0, 0, 0, 1, 1, 1, 0,-1,-1, -1,-1, 0, 1, 1, 1, 1, 1, 0, 0, -1,-1,-1, 0, 0, 1, 1, 0, 0,-1, -1,-1, 0, 0, 1, 0, 1, 1, 1, 1, 0,-1,-1,-1,-1, 0, 1, 1, 0,-1, 0,-1,-1, 0, 1,

Figure 4. The linear filters used to detect the eight orientations

With the above filters, on one hand we get an orientation index value – which is the

number of the filter (0-7) for which the returned value is highest – and on the other hand

we get an edge intensity value, which is the actual result value of this particular filter

(normalized to values from 0 to 255). The orientation index values are used for

localization, as described in section 3.3 below, while the edge intensity values are used to

create the attractor set of the active contour model, as described in section 4.1. Both can

be visualized with a bitmap image (see Figure 5 below); the picture created from the edge

intensity values is called the edge intensity image, while the image which represents the

orientation indices is called the orientation index image.

Figure 5. The edge intensity image (left) and the orientation index image (right) of an ear

3.3. Localization

For localization we initially used a neural network-based solution (see [9] for a

description of the method), but as it appeared to be too slow for a real-time application,

we decided to replace it with an edge orientation pattern matching method. An ear-like

object is found on the image if the fitting level value of the edge orientation pattern at a

certain position is higher than a pre-defined threshold.

To develop the pattern model that will be searched for during localization, first we

calculate the orientation indices (described above in section 3.2) representing the

dominant edge orientations for every pixel on several images on which the ear is

localized and normalized manually. To get a practicable pattern model, we reduce the

size of the training orientation images by a factor of four (by applying a filter shown in

Figure 6); we simply calculate the weighted average of the neighboring orientations for

every second row and column position. Thus we not only get a smaller pattern model, but

the distribution of areas in the common ear pattern, which have the same orientation is

also made smoother. So as a result the sensitivity to the shape of the ear during the

pattern matching localization is tempered.

1,2,2,1, 2,6,6,2, 2,6,6,2, 1,2,2,1,

Figure 6. The weights within the 4x4 pixel filter used in deriving the orientation pattern

To create a model for the edge orientation pattern, we simply averaged the reduced

orientation images. After this procedure we got the following ear-like image, which uses

different colors to represent the indices of the characteristic orientations for certain areas:

Figure 7. Our average edge orientation pattern model for ears

To localize the ear, we scan the edge orientation representation of the input image

using this pattern as a sliding window, and select the position which gives the best fit for

the pattern, related to the edge orientations found on the image. The level of fit is

calculated by iterating through the pattern, and adding a positive reward value if the

orientation on the pattern model matches that on the underlying image at a certain

position, while subtracting 1 for each position where it does not match.

Unfortunately, due to the large variance in edge orientations on the ear, some

localization errors still occurred using the edge orientation pattern introduced above, so

we decided to develop and use several patterns. Instead of averaging the edge orientations

to get a single pattern, we formed clusters in space occupied by the edge orientation

images, thus gaining several characteristic patterns. We tested our edge orientation-based

localization algorithm with patterns formed in two to six clusters, and found the optimal

cluster (pattern) number to be three. The most characteristic patterns resulting from this

are shown in Figure 8 below:

Figure 8. The three derived edge orientation patterns used for ear localization

Localization using three patterns means that all patterns are searched for in parallel,

and the position with the best match from any of the three is accepted. The difference in

effectiveness between one-pattern localization and three-pattern localization can be seen

in Figure 9 at the end of the next section.

3.4. Geometric normalization

The best-fitting pattern (one of the three introduced above) and its localized position on

the input image determine a linear (rotation-translation-scaling) transformation, which we

can use to map (normalize) the ear image into a relatively good initial position for the

active contour iterations. All three patterns have their own pre-defined characteristic

linear transformations, so the transformation is simply a pre-defined constant property of

the best-fitting pattern.

After applying this transformation based on the identified edge orientation pattern, we

get geometrically normalized ear images that fit into a pre-defined rectangle and are thus

comparable, which means that feature extraction can be carried out. At the same time this

normalization can show us the accuracy of the localization; for this we projected the

normalized images onto each other, and compared the sharpness of the derived averaged

edges for different localization methods. The result is shown in Figure 9 below:

Figure 9. The effectiveness of one- and three-pattern normalization

The right-hand image, which is generated by using three-pattern localization, is

sharper than the left-hand one; this means that it has lower entropy, so by using three

patterns instead of one we can achieve better localization, and consequently better

normalization of the ear. This is an important aspect of active contour-based feature

extraction.

4. FEATURE EXTRACTION

4.1. Edge processing

The first step of feature extraction is to create threads of pixels on the image which will

be the attractors of the active contour model. For this, first we create the level set of the

edge-filtered and normalized image: this is simply the set of pixels having values greater

than a pre-defined intensity threshold (intensity in this case being defined on the edge-

filtered image), so it is the “strength” of the edge at a given position. An example level

set is presented with darker raster graphics in Figure 10 below (left).

Figure 10. The level set of an ear image (left) and its weakening after several morphologic iterations (right)

To weaken the level set we use morphologic transformations based on the one hand on

the intensity values (edge strength) of the pixels, and on the other hand on the number of

side (horizontal and vertical) neighbors of each pixel in the set, but the corner neighbors

are also taken into account in some cases. Each pixel can have 0-4 neighbors, but the aim

of all the applied rules is to “clean” the set to reach a final state in which pixel threads are

built up from pixels having only one or two neighbors.

The initial level set has a threshold of 16 (intensities can have values from 0 to 255),

and weakening is accomplished through several iterations using a multitude of cellular

automata rules. With each iteration the first step eliminates some “weak” bordering

pixels: we simply eradicate a pixel which has two side neighbors if the corner neighbor

between these side neighbors has higher intensity than the pixel in question; we also

eradicate a pixel which has three side neighbors and two corner neighbors (between the

side neighbors) of higher intensity. The second step of each iteration erases those inner

pixels of the level set which are below a threshold value and have four side neighbors.

This threshold value depends on the number of the current iteration, and the thresholds

are 32, 64 and 128 respectively.

After three such iterations we erase the rest of the inner pixels (which have four

neighbors), and get a weakened level set, as seen in Figure 10 (right). At the very end of

edge processing we simply eradicate the weakest threads originating from the remaining

points with three neighbors (one thread from the three which has minimal intensity within

a distance of four-pixels from the meeting point), to end up with threads in which all the

pixels have one or two neighbors. These independent threads will generate the attracting

forces for the active contour iterations.

4.2. Iterating the outer contour

After the image has been normalized and the edges have been detected and clarified, we

iterate the lines of a standard ear model to them; however, due to the inaccuracy of

localization, it has proved more effective to first iterate the outer contour of the ear model

approximately to the bordering edge of the ear, and to execute precise matching of the

inner contour curves afterwards. We can think of this step in feature extraction as a more

precise localization of the ear, on the basis of which the rest of the ear image is further

normalized.

To determine the forces used for the iteration of the outer ear contour, we simply use

the pixel threads determined by the edge detection method described above, while also

taking into account the tangential properties of these threads. This means that pixels in a

thread representing an edge exert a greater attraction force on a curve segment of the

model if that thread is parallel to the segment. In fact the force is proportional to the

absolute value of the scalar product of the two tangents, that of the model’s closest

section and that of the pixel thread: ( ) 2

ptacext ttFrrr

⋅= , where actr

is the normalized tangent

vector of the active contour at an arbitrary point, and pttr

is the tangent vector of a pixel

thread in the vicinity of that point.

Figure 11. Pixel threads with the iteration of the outer edge of the ear (left) and the normalized ear edges projected onto each other (right)

When the outer contour of the ear model reaches its equilibrium position (see the

darker smooth line in Figure 11, left), this border defines a space in which we carry out

the fine-tuned matching of the inner model. This is actually normalization to the initial

model, as instead of the original coordinates on the ear image, we hereafter use

transformed points, for which the transformation is defined by the position and shape of

the determined outer edge.

To examine the behavior of the thus normalized ear edges, we projected them onto

each other again, forming an average normalized ear (Figure 11, right). It is plainly

observable that inside the clear and dark ear border we have certain remarkably clear

areas which – though characteristic edges on every ear – vary enough to be used as

features.

4.3. Fitting the ear model

As the dark areas on the average normalized ear image shown in Figure 11 (right)

suggest, the border contour and the three loose inner edges can represent the curves of

our compound ear model shown below in Figure 12 (left). The shape and the interrelation

of these edges are defined in the form of the internal forces of the model.

Figure 12. Our common ear model (left) and the iteration of the active contours to the actual ear image (right)

To eliminate the external forces originating from pixel threads belonging to other

curves, we can classify the detected attracting pixel threads according to the nearest

active contour curve on the initial model. For this we define pixel zones on the

normalized ear image, and the pixel threads that will attract the active contour are

classified according to the zone that the majority of their pixels fall in; thus every curve

has its corresponding (attracting) pixel threads.

When using the active contour method, in every step of the iteration of the model we

deal both with internal and external forces. In the first step – dealing with external forces

– we select random pixel points uniformly distributed on an active contour curve, and

calculate the external forces in the vicinity of those positions. To calculate the forces

attracting a curve we calculate only the attraction of corresponding pixel threads;

moreover, similarly to the handling of the outer contour, we take into account the scalar

product of the tangents of the pixel threads and the active contour sections.

As a second step, the internal forces that tend to preserve the original shape of the

active contour are designed to be inversely proportional to the difference between the

initial and the actual positions of the segmentation points (the points where the curve

segments of the model are joined). The shape-preserving properties of the model are

expressed as rigidity coefficients: we define different rigidity factors for the four curves

of the model to prevent deformation of their shape, and other coefficients to prevent their

mutual displacement. The rigidity coefficients used for the shapes of the separate curves

are higher than those preserving their distance from each other, which means that the

original shapes of the curves are likely to be preserved, while their relative positions can

vary more.

Figure 12 (right) shows an active contour model in its final state, in which the internal

and external forces are in the equilibrium position; the model fits the underlying

normalized image, while the original shapes of the curves are roughly preserved.

4.4. Feature extraction

At this point we have an active contour that best fits the underlying ear image. The last

step is to analyze this model and collect feature values from it, in order to form a feature

vector which could be used for ear-based identification. Basically, we define two sets of

features.

The first set of features is derived from the distortion of the model, expressed as the

difference between the original and the final (reposed) state of the active contour on an

ear image. This difference is on the one hand derived by determining the distances of the

reposed curves from the original curves, which can be done by measuring the distance

between the each segmentation point on the original model curve and the point where the

perpendicular of a tangent through this point cuts the reposed curve (denoted by the

thicker line in Figure 13, left, A). As we have thirty-one segmentation points in total on

the four curves of the model, this method provides us with thirty-one feature values. On

the other hand, the distance of the reposed curves’ segmentation points from the

underlying pixel threads can be measured similarly, providing an additional thirty-one

parameters. Finally, we also measure the thickness of the edges under the segments of the

contour, so this feature set has 31 x 3 = 93 feature values.

The rest of the features are derived using three appropriately selected axes of

measurement and twenty-one chosen points, selected on the model so that the tangents of

the curves at these points are perpendicular to one of the axes. Thus every point is

associated with one axis, and we have six, seven and eight points associated with the

three axes respectively. For a reposed active contour we determine the projection of these

points on the associated axis (P1), and the feature is formed from the distances (marked f1

on Figure 13, left, B) measured between these projection points and the intersection of

the three axes. In this way we obtain an additional twenty-one feature values, which –

due to the appropriate selection of the axes of measurement – are fairly independent of

the angle from which the ear image was taken. Merging the two feature sets, we gain a

114-dimensional feature vector, elements of which are called static features. These are

expanded further by the so-called dynamic features, as explained below.

A)

B)

Figure 13. The distortion (A) and direction (B) derived features and the three axes of measurement, with the points to be projected (right)

In the case of sample collection using a video surveillance system, we can use several

ear sample images to make a single identification (acceptance or rejection) decision. The

sample images are taken from consecutive video frames showing the same person,

enabling us to calculate many more reliable features by taking into account the values

measured on all available samples. For this we on one hand average all of the static

feature values described above to get a single feature vector, and on the other hand we

extend this vector with further values by carrying out a regression analysis of the features

extracted using the axes of measurement. To do this, we use a special square-error-

minimizing regression formula, which not only means that we can determine the

parameters of the linear function which best fits the measured values, but also that the

values measured on the samples are ordered in the best way to form these lines. In Figure

14 below we can see the comparison of two such sets of regression lines belonging to the

six-point axis for two different people. Feature values taken from different samples are

arranged horizontally, while the six feature values measured on each image are denoted

by different colors. In this best-fitting arrangement, the regression lines determine six

more feature values, as we simply store the steepness of these lines (parameter a in

baxy += ); similarly, we have seven more values for the second and eight for the third

axis of measurement, ending up with twenty-one dynamic feature values.

So, by having several image samples from the same observed person, the 114 static

feature values are simply averaged, and this average is augmented by the dynamic

features, which are derived from the above described regression lines. With three axes of

measurement having 6+7+8 measurement points, we finally get a 114+21=135-

dimensional averaged feature vector for every video sequence showing the same person

over several frames. This completes our feature extraction method.

Figure 14. The arranged feature values of the axis with six points from two different samples

Because a person passing in front of the camera is usually visible from different

angles, the dynamic features are very distinctive. This is not surprising, as in a way they

carry some information about the three-dimensional shape of the ear.

4.5. Identification

The identification decision during the comparison of ear samples is based on the

difference between the two feature vectors, defined as follows. For the static features we

simply calculate the scalar difference for every one of the static 114 coordinates in the

averaged feature vectors, while for dynamic features (the last twenty-one coordinates),

we calculate three distance parameters – one for each axis of measurement – by taking

into account the distance of all regression lines for that axis (six, seven or eight). The

distance between the two line sets derived for an axis of measurement is calculated by

fitting one line set to the other with a linear transformation, so that the absolute integrals

of their difference within a pre-defined interval is minimal, and in this transformed

position we calculate the difference in their (transformed) steepness. This means that

while we have 135-dimensional feature vectors (114 static and 21 dynamic feature

values), the difference vector has only 114+3=117 dimensions.

Finally, to decide acceptance or rejection (whether the samples are from the target

person/ear or not) we project this difference vector onto an optimal direction, to be able

to define a simple scalar threshold for acceptance or rejection. This optimal direction can

be determined in advance with a training process. We first tried to train a neural

perceptron to calculate this projection direction; but in the end, for efficiency reasons, we

decided to use a simple regression formula, defining the optimal direction to be the one

between the centers of gravities of two difference vector sets, one of which is calculated

from the averaged feature vectors belonging to the same person (ear) and the other from

the difference between features from different persons.

The calculation method using the regression formula for this optimal direction is the

following: let s and d be the sets of difference vectors originating from the samples

belonging to the same and different persons respectively. Now the optimal direction v

can be found where the formula ∑∑∈∈

⋅−⋅dd

kss

i

ki

vdsvsd 22 )()( has its minimum, as this is

the direction in which we can get the most distinguishable gap between the sets formed

from vectors is and kd .

The scalar associated to this vector is simply its intensity (the square root of the sums

of its coordinates), and for this value we can simply define a threshold value determined

by a pre-analysis of the test database. If the intensity of the projected difference vector is

lower than this threshold, then the two ear samples are assumed to originate from the

same ear; if the distance value is above the threshold, the samples are from different ears.

5. RESULTS

The proposed model-based method for human ear identification was tested on images

chosen from video frames taken of twenty-eight people. Both localization and feature

extraction were tested: for each person we had five sequences, each of around fifty

frames, on half of which an ear was detected. These ear images had very similar

resolutions, approximately 80x120 pixels. The viewing angles for ears varied by as much

as fifty degrees – a variation still tolerated by the edge orientation patterns.

The video sequences were initially assigned to subjects, enabling comparison testing

of acceptance and rejection. To carry out tests for false acceptance rate (FAR) and false

rejection rate (FRR), we first calculated the feature vectors for every frame containing an

ear, and formed the average feature vector for every video sequence. We gained 140

(28x5) averaged feature vectors by processing a total of 3,500 frames, or ear image

samples. Difference vectors were formed for each feature vector pair, and the intensity of

the projection of these distance vectors was compared to the pre-defined threshold, to

decide on acceptance or rejection.

Two images used in the FAR and FRR comparison tests are shown in Figure 15

below, in which the vertical axis shows the error rate as a percentage, and the horizontal

axis shows the threshold for the feature vector distances used in determining acceptance.

The equal error rate (EER, the error rate at the threshold value, where FAR equals FRR)

was 5.6%.

0

10

20

30

40

50

60

Threshold

% FRR

FAR

Figure 15. Two images containing ear samples used in the ear-based identification,

and the error graph showing the FAR and FRR error rates

6. CONCLUSIONS

As compared to our previous results (see Results and Conclusions in [10]), we can see

that by augmenting the static features with the measurement of the thickness of the edges,

and by refining the dynamic features in addition to simply averaging the values measured

on subsequent video frames, we were able to improve our method further, decreasing the

equal error rate (EER) to 5.6%.

The strength of dynamic features originates from the fact that as they integrate

measured values from a multitude of ear image samples taken from different angles in a

more complex way, in a way carry they information about the spatial model of the ear

shape without actually building or handling the exact three-dimensional model.

Moreover, by using localization method based on edge orientation patterns we got an

identification process which can be run in real time for video surveillance frames

providing 20 FPS. This is especially important, as the main application area for ear-based

identification is our intelligent video surveillance system, Identrace [4].

7. ACKNOWLEDGEMENTS

The research is now ongoing as part of the Integrated Biometric Identification System

project (identification number 2/030/2004), with the support of the National Research and

Development Programme (NKFP) of the National Office for Research and Technology

(NKTH), Hungary.

8. REFERENCES

[1] Alfred Iannarelli, Ear Identification, Forensic Identification Series, Paramont Publishing

Company, Fremont, California, 1989;

[2] Mark Burge, Wilhelm Burger, Ear Biometrics, 1998;

http://www.computing.armstrong.edu/FacNStaff/burge/pdf/burge-burger-us.pdf

[3] D. J. Hurley, M. S. Nixon, J. N. Carter, Force field feature extraction for ear biometrics,

Computer Vision and Image Understanding, pp. 491-512, 2005;

http://eprints.ecs.soton.ac.uk/10242/01/hurley_cviu.pdf

[4] Identrace – Beyond Surveillance, Intelligent Video Surveillance System;

http://www.identrace.com/

[5] M. Kass, A. Witkin, D. Terzopoulos, Snakes: Active contour models, International Journal of

Computer Vision, Vol. 4, pp 321-331, 1988; http://mrl.nyu.edu/~dt/papers/ijcv88/ijcv88.pdf

[6] Li Bai, LinLin Shen, Face Detection by Orientation Map Matching, Proceedings of the

International Conference on Computational Intelligence for Modeling Control and Automation,

Austria, Feb., 2003, pp.363-369; http://www.cs.nott.ac.uk/~bai/papers/CIMCA_2003.pdf

[7] Bernhard Fröba, Christian Küblbeck, Robust Face Detection at Video Frame Rate Based on

Edge Orientation Features, Proceedings of the Fifth IEEE International Conference on Automatic

Face and Gesture Recognition, Washington D.C, May 2002, pp 342;

http://www.embassi.de/publi/veroeffent/Froeba-Kueblbeck.pdf

[8] A. Blake and M. Isard, Active Contours, Springer, 1998;

http://www.robots.ox.ac.uk/~contours/

[9] László Máté, Localizing Feature Points on Ear Images, HACIPPR, Veszprém, ISBN 3-

85403-192-0: pp 57-63, 2005;

[10] Ernő Jeges, László Máté: Model-based human ear identification, World Automation

Congress, 5th International Forum on Multimedia and Image Processing (IFMIP), 2006;