ATTENTIONAL MANAGEMENT FOR MULTIPLE TARGET … · ATTENTIONAL MANAGEMENT FOR MULTIPLE TARGET TRACKING BY A ... Jacques Waldmann Instituto Tecnológico de Aeronáutica Departamento

SBA Controle & Automação Vol.11 no. 03 / Set., Out., Nov, Dezembro de 2000 187

ATTENTIONAL MANAGEMENT FOR MULTIPLE TARGET TRACKING BY ABINOCULAR VISION HEAD

Fábio de Freitas Caetano Jacques WaldmannInstituto Tecnológico de Aeronáutica

Departamento de Sistemas e Controle - Divisão de Engenharia EletrônicaCTA-ITA-IEES - 12228-900 - São José dos Campos – SP

[email protected]

Abstract: The development of an attention strategy formultiple target tracking and its integration with a binocularvision head is presented. Monocular motion-basedsegmentation yields image regions that are further clusteredinto targets. The emulation of a fovea significantly reduces thecomputational workload of target segmentation. Attentionalmanagement is based on assigning an interest value to eachtarget. Its evaluation weighs binocular disparity, the number ofdetected moving pixels in a target, its pixel density, velocityand duration along the image sequence. The vision headperforms saccades, ballistic motions and smooth pursuit aroundits tilt and vergence axes with characteristics similar to those ofthe anthropomorphic visual apparatus. System performance isevaluated on a 166 Mhz PC host with unexpensive off-the-shelf hardware in two situations. In the first case, small targetstranslate independently over a textured background. In thesecond, a moving human subject yields multiple targetsundergoing rotation, non-rigid motion and scale changes. Aprocessing rate about 0.8 stereo pair per second is attained. Thefirst case shows the tracking and image stabilizationcapabilities for small translating targets only. The second casereveals the impact of hardware limitations on the systemsensitivity to distortions in gray level patterns. It is induced byinterframe rotation and changes in the illumination and scalefactor as the tracked human subject moves around in the scene.As a consequence, undesired attentional reorientations occur.However, the system succeeds in keeping the subject within thefield of view in such conditions though not always stabilizedwithin the image region of fixation.

Keywords: Active vision, visual attention, binocular tracking,fovea-aided multiple target segmentation, robotic vision head,real-time control, autonomous robotic systems.

1 INTRODUCTION

Since immemorial times the sense of vision supports humanbeings in their search for food, shelter, evasion from predatorsand overall survival. Everyday, humans engage countless timesin complex visual tasks such as detecting and tracking objectsthat somehow draw their attention. Detection is hereconsidered as the confirmation that an object is present on the

image whereas tracking concerns maintaining a given objectwithin the field of view. The latter is often accomplished bycontrolling eye position and velocity. The connection betweentracking and attention, which is a critical assumption in thiswork, is made explicit here. It is hypothesized that the visualtracking behavior in humans is driven by a sort of attentionalmanagement. It seems reasonable to claim that some sort ofdecision is required whether an object raises the systemattention for tracking purposes and whether another objectentering its field of view becomes more interesting, thusproducing an attentional reorientation.

The literature describes robotic vision systems that reproducesome of the features found in biological vision systems thatsuccessfully evolved in nature. The motivation behind suchefforts can be traced to the need for automated industrial andinformation processes based on visual feedback. An effectivevisual tracking system can be useful in a wide range ofapplications such as automatic surveillance (Batista et allii,1998), traffic monitoring (Dagless et allii, 1993) vision-aidedautonomous vehicles (Balkenius and Kopp, 1996; Huber andKortenkamp, 1995), multiple target detection, selection andtracking by an aircraft or missile (Bar-Shalom and Li, 1993)and novel man-machine interfaces such as guidance support tothe visually impaired (Crisman et allii, 1998; Molton et allii,1998). Research efforts aligned with the work presented hereexist on the areas of visual tracking (Andersen, 1996; Murrayand Basu, 1994; Uhlin, 1996), visual attention (Bandera et allii,1996; Culhane and Tsotsos, 1992) and the development ofanthropomorphic-inspired robotic heads for active vision(Andersen, 1996; Weiman and Vincze, 1996).

1.1 Visual Tracking

Model-driven tracking is based on high level information todescribe objects on the image (Andersen, 1996). Objectfeatures are sought which are robust to scale changes,translation and rotation of the object projection on the image.Image analysis resorts often to object-centered representations.However, unknown objects can be incorrectly processed withdetrimental effects on system performance. Therefore, model-driven tracking may not be suitable in an unstructuredenvironment as the completeness of the model database and therobustness of the representation scheme become critical.Moreover, it seems reasonable to assume that the model-drivenapproach imposes a heavy computational workload whendealing with day-to-day situations as expected in this work.

Artigo Submetido em 02/02/20001a. Revisão em 31/07/2000Aceito sob recomendação do Ed. Consultor Prof. Dr. JacquesSzczupak

188 SBA Controle & Automação Vol. 11 no. 03 / Set., Out., Nov, Dezembro de 2000

Hence, the data-driven approach is selected to circumvent thecomplexity inherent to the model matching problem whendealing with unstructured environments and to keep thecomputational burden bearable to unexpensive PC-basedhardware. The latter approach utilizes pixel-based informationsuch as brightness, motion, texture and contrast and sets ofimage points result as possible targets (Culhane and Tsotsos,1992).

Binocular tracking requires control of both fields of view toproject the image of any object raising the system interest onthe region of fixation of both cameras. The human eye carriesout this task by means of saccades and smooth pursuit (Araújoet allii, 1996; Batista et allii, 1997; Cowie and Taylor, 1997;Rotstein and Rivlin, 1996; Uhlin, 1996). Saccades are fast andusually have a large amplitude. They are associated with theallocation of attention to another region of the 3D space. Rapidminute corrections of the eyes' positions during the fixation ofa static object occur and are called microsaccades. The highspeed of a saccade precludes the processing of visualinformation due to motion blur. Smooth pursuit consists of aslow eye motion during tracking. It aims at maintaining theviewed object projection on the fovea where the retinaresolution peaks. A mismatch between sensor velocity andobject retinal velocity causes the object projection to slip awayfrom the fovea. It is usually remedied by means of amicrosaccade and smooth pursuit then resumes. Wheneverattention is shifted to a different portion of the scene a saccadeto that region occurs followed by smooth pursuit andmicrosaccades. In the present work, the binocular vision headis controlled according to the following guidelines:

1- Ballistic motions. These correspond to what is described assaccades in the literature. They employ high speedmovements to control the camera position. It is employed tocoarsely aim at an object that has just attracted the systemattention. The rapid motion prevents the processing ofvisual information. The camera must halt to resume imageprocessing. Otherwise, overshoot and oscillations result inblurred images.

2 - Saccades. Here differently from the literature, these arecorrections to the camera angular velocity. They areengaged whenever a velocity mismatch between the cameraand the object projection occurs causing significant sliperror relative to the region of fixation. The camera velocityis then updated to reduce this error. Image processingcontinues during this type of motion.

3- Smooth pursuit occurs when the slip error is withinacceptable bounds and the tracked object projects on theregion of fixation thus obviating any corrections in cameravelocity. Ideally, image processing benefits from sharperimages as the target projection is stabilized on the region offixation.

1.2 Visual Attention

Early visual processing in humans is often characterized ashaving two functionally distinct modes. The first, preattentive,in which information is processed in a spatially parallel mannerthat circumvents the need for attentional resources, and thesecond, which involves the allocation of attentional resourcesto specific locations or objects for more complex analysis. Thelatter mode, also known as the "where to look next" problem, isa key issue when dealing with various objects which competefor the system attention. Research concerning the allocation of

attentional resources in human beings has yielded evidence foranother functional dichotomy: attentional orienting can bevoluntary, controlled by strategic goals (endogenous attentionalcontrol), or involuntary, driven by particular stimulus events(exogenous attentional control), as in Folk et allii, 1992. Theyclaimed the existence of psychophysical evidence supportingthe "contingent involuntary orienting hypothesis", according towhich involuntary shifts of attention to a stimulus eventdepended on whether the event shared a feature property thatwas critical to the performance of the desired task. Involuntaryshifts of attention were systematically dependent on therelationship between the stimulus properties of the distractorcue and the properties required to locate and process the target.Inspired in such observations, it is proposed here that thedesign of a machine capable of visually tracking multipletargets should result in a system behavior consistent with thedesired goal while reducing undesired attentionalreorientations. On the other hand, the management ofsurroundings awareness by the system should provideflexibility to consider changing environment conditions andtheir impact on system goals in order to adequately modify theattentional control settings according to the need for adaptationand autonomous behavior.

Real-time visual tracking, either in biological systems or intheir machine counterparts, requires dedicated hardware with ahigh processing capacity. Even with the evolution of dedicatedhardware to meet the needs of parallel computing along withimproved actuators and controllers for machine vision systems,often one is limited by the available computational resources.Inspired in nature's solution to this question, an attentionalmanagement strategy should attempt to carry out an efficientusage of limited system resources. Spatial compression ofvisual data occurs at the retina with a varying resolution thatpeaks at the fovea. The eyes scan the surroundings and positionthe projection of an area of interest onto the fovea to acquirehigh resolution data. Peripheral vision also plays a role at thelevel of shifting the system attention to other interesting objectswithin the field of view. Overall, the computational burden iskept bearable.

It is assumed that tracking is neither activated nor terminatedwithout a purpose. Such purpose induces the attention of thevisual system to focus on a specific region or to shift focus toanother one. Interest on a given object may cease as the systemgoal is attained such as the completion of its recognition or theestimation of its trajectory parameters, for instance. Aquantitative determination of the interest raised by multiplemoving objects in a scene is considered here in order toestablish priorities for tracking. An attention function based onimage features is proposed to evaluate which part of the imagecontents should raise the system interest. The selection of task-relevant reliable image features, called here attentionalfeatures, and of an adequate structure for the attention functionis critical since they directly influence the system behavior(Andersen, 1996; Araújo et allii, 1996a,1996b; Batista et allii,1997; Culhane and Tsotsos, 1992; Hager and Belhumeur, 1996;Huber and Kortenkamp, 1995; Murray and Basu, 1994; Shi andTomasi, 1994; Uhlin, 1996). Among the features that couldraise the attention of the system are object brightness, shape,velocity, distance to the observer and the duration of itsoccurrence within the field of view. In yet another directionwhich involves the application of estimation theory, Kalmanfiltering and α-β-γ filters have been utilized to predict thepositions of simultaneous objects in consecutive images butwithout addressing the aspect of establishing priorities fortracking (Andersen, 1996; Bar-Shalom and Li, 1993).


This paper presents and evaluates a binocular tracking systemwith its computational resources allocated according to anattentional management capability. It is expected to select andtrack a target that raises the system attention among otherindependently moving objects. System behavior is analyzed interms of target competition for the system attention, targetselection, attentional reorientation and the system ability tomaintain the selected target within the field of view bycontrolling camera position and velocity. The term "target" ishere employed to indicate a moving region on the image thatraises attention. The analysis encompasses the effect upon theoverall system behavior of a variety of issues such as motiondetection and target segmentation with moving cameras in anunstructured environment, the real-time processingrequirements for adequate operation and the adequacy of theattention function, i.e., its structure, composing attentionalfeatures and weight selection.

2 SYSTEM DESCRIPTION

Visual processing of images acquired by the vision head isdepicted in figure 1. The dominant camera is the left one of thestereo assembly. It is assumed that anything that moves in ascene is a potential target. Motion detection employs imagesubtraction (Bispo and Waldmann, 1998; Murray and Basu,1994). A search for moving regions proceeds with the head instatic mode (static search). Monocular localization andclustering of moving regions into targets on the dominantimage follow. Target matching, either across stereo images oralong a sequence, employs the estimated centroid position andits surrounding gray-level pattern yielding estimates of targetvelocity on the dominant image and of depth-related stereodisparity. These features enter the attention function along withthe number of moving pixels within the target area, its densityand a time measurement which indicates for how long eachtarget has been within the field of view. The evaluation of theattention function for each target is stored in a dynamic listwhich records the relative interest raised by the targets and thussupports the selection of the target to track. Position andvelocity estimates of its centroid are used to compute angularincrements which are the command signals for controlling themotion of the dominant camera. Additionally, the estimateddisparity is the error signal for controlling the asymmetricvergence of the stereo assembly which leads to binocularfixation. This process continues as targets evolve within thefield of view. Static search for target motion resumes wheneverno motion is detected in the scene after eight images.

In the following, the various vision algorithms and theirintegration in a vision machine are described in more detail.Images are monochromatic with a resolution of 160 pixels x120 pixels and 256 gray levels unless otherwise stated. It hasbeen so selected as a compromise between sufficiently accurateinformation and keeping the computational workload at abearable level for a PC equipped with relatively unexpensiveoff-the-shelf image acquisition, A-D/D-A conversion and datacommunication hardware.

2.1 Processing the dominant images

Motion detection, localization of moving regions, targetsegmentation and centroid estimation result from theprocessing of images acquired by the dominant camera duringtracking. Motion detection is done by thresholding thesubtraction of a pair of consecutive images. Typically, imagesubtraction indicates the relative differences in both

consecutive images. Differences arising from the previousimage are discarded by means of a logical "AND" between thethresholded subtraction image and the gradient of the presentimage. The underlying assumption is that a moving targetpossesses a much richer texture than its surroundingbackground. In this case, the motion detector outputs sets ofpixels, called here "moving pixels", that belong to movingimage regions. Prewitt masks and a convenient threshold areused for the gradient operation (Fairhurst, 1988).

Initialization

Static search for motion

(dominant camera)

Moving

detected?region

Yes

No

Monocular localization

Target segmentation

(dominant camera)

Monocular estimation

of target velocity

(dominant camera)

Binocular

registration

Target data: centroid, gray-level pattern

Attentional

management

Target data:

velocity

Target data: disparity

Target data:

centroid,

number of moving pixels,

density

Timer

Selected

target?

YesBinocular fixation:

ballistic motion

saccade

smooth pursuit

No8 consecutive

images?

No

Yes

Figure 1 - System implementation with information flow

(dashed lines).

As the camera tracks a selected target, adequate motiondetection calls for image stabilization as seen in figure 2 (Bispoand Waldmann, 1998; Murray and Basu, 1994). It isaccomplished by reading measurements from encoders that arefixed to the tilt and vergence axes. Measurement errors, delaysbetween image acquisition and encoder reading, uncertaintiesin the optical parameters and undesired camera translation dueto the distance from the rotating axes to the optical center causeinaccurate stabilization and thus noisy motion detection. It isalmost completely remedied by the morphological gray-levelopening operator defined in the following sequence ofoperations (Murray and Basu, 1994):

:j)(i,on centered

Relement structure )l,k( andarray image )j,i( ∈∈∀

))l,k(I)l,k(I(min)j,i(I stabilized,1kkl,k

eroded −−=

)l,k(Imax)j,i(I erodedl,k

opened = (1)

The result of equation (1) is thresholded and a binary imagewith the moving pixels result. Many tiny regions falselydetected as moving because of incorrect image stabilization arethus removed. The remaining regions are assumed to be partsof the actual targets. The clustering of moving pixels intoclasses and subsequently into superclasses employs a distancecriterion which is defined on the image plane. Each superclasscentroid represents a target location on the dominant image. Itis regarded as a candidate for selection and tracking accordingto its interest value relative to those of the other targets.

2.2 Target segmentation

Often independently moving targets exist in the scene. Ingeneral, moving regions on the image plane might result fromeither the same target or from different ones. The clustering ofsuch regions into classes employs a distance-based criterion.The clustering algorithm operates on the binary image thatresults from thresholding the output of equation (1). There is


no sort of motion analysis. It reflects the hypothesis that nearbymoving regions belong to the same target and it fails ifneighboring regions move very differently because theyactually belong to distinct nearby targets.

(a) (b)

(c) (d)

Figure 2 - Image subtraction with compensation of cameramotion. a) Previous image. b) Image after motion

compensation. c) Present image. d) Subtraction result.

Clustering is a computationally heavy task. It is divided in twophases. Initially, classes are derived from sets of nearby pixelswhich belong to moving regions. The same target may spreadover a number of classes if its projection on the image planecovers a large area. Afterwards, the derived classes are groupedinto superclasses which are analogous to classes but at amacroscopic level. Therefore, as nearby pixels are clusteredinto classes, likewise are nearby classes into superclasses. Thelatter is the adopted representation for the multiple targets onthe image.

Since the purpose of visual tracking in primates seems to be thestabilization of the projection of the selected target on the high-resolution fovea, it seems reasonable to expect that for machinevision systems a corresponding area on the image should havea higher resolution and receive a larger share of attention. Aregion of fixation is then defined on each stereo image. Targetsthat raise attention should be maintained within such a regionwith the highest resolution available for processing. Theresulting foveated images are employed to alleviate thecomputational burden, as in Bandera et allii (1996) and Scottand Bandera (1990). The region of fixation employs theoriginal image resolution and subsampling generates aperiphery with a coarser resolution.

2.2.1 Foveated images by subsampling

Some topologies for multiple resolution images are shown inAndersen (1996). In general, it consists of a central fovea withits maximum resolution surrounded by rings of decreasingresolution. To alleviate the computational workload ofclustering the moving regions, the partition of the image in afovea and a single-ring periphery is considered satisfactory.The central 80x60 fovea is selected to coincide with the regionof fixation. Subsampling is done by averaging within 4x4windows and then thresholding. Only the binary output ofequation (1) which corresponds to moving regions on theimage periphery is subsampled, thus further reducing thecomputional workload. An instance of the fovea-aided motion

detector operating on images acquired with full resolution isshown in figure 3.

(a) (b)

(c)

Figure 3 - Fovea-aided motion detector. a) Previous image.b) Present image. c) Binary result.

2.2.2 Clustering moving pixels into classes

The clustering algorithm employed here is a modified versionof that described by Fairhurst (1988). Basically, it scans theoutput of the motion detector for pixels labeled as moving andattempts to assign them to classes according to a distancemetric. Class attributes are the number of moving pixels in aclass, #pix(class), the size of the smallest rectangle enclosingsuch pixels, its corresponding area and the density of movingpixels, dens(class), in this rectangle. Three cases may occur:

1 - None among the n already existing classes c1,...,cn is found

sufficiently close to the moving pixel )y,x(=c under

consideration. In this case, it becomes the seed of a newclass cn+1 with its centroid )y,x( 1n1n1n +++ =c . Its

coordinates are described in the high-resolution coordinateframe regardless of whether it is on the fovea or on theperiphery. The quantity of pixels in this new class is 1 if itis on the fovea or 16 on the periphery, the lattercorresponding to a 4x4 window of high-resolution pixels.Likewise, its area is defined as 1 if on the fovea or 16 onthe periphery. The density is one corresponding to the ratiobetween the quantity of pixels in the class and its area. Thesmallest enclosing rectangle is a square with dimensions1x1 on the fovea or 4x4 on the periphery.

2 - n classes have been already initiated, each one with itsrespective moving pixels given by

n,...,1k);c(pix#,...,2,1p),y,x( kpp == and at least one of

them is sufficiently close to the moving pixel )y,x(=c

under consideration. Proximity to the k-th class centroid kc

is defined according to Euclidean distance on the maximumresolution image and independent of whether the fovea orthe periphery are involved:

n,...,2,1k,Lmind d2k=≤−= kcc (2)

If the above criterion is true, then the k-th class thatminimizes the above equation and does not violate the


inequality accepts )y,x(=c and the class attributes are

updated, i.e., its number of moving pixels #pix(ck) is

incremented, the size of the smallest enclosing rectangle, itsarea and density, each of them as required and according towhether the location of )y,x(=c is on the fovea or on the

periphery. Class centroid is not updated in this case.

3 - Classes have been initiated but none satisfies dLd ≤ . In

this case, all class centroids are recomputed to reflect themost recent state of clustering and a new attempt atminimizing equation (2) is made. If successful, theprocedure is analogous to case 2. If not, the procedurecontinues according to case 1. The purpose of case 3 is toreduce the amount of required computations.

The threshold dL is chosen to bias the clustering results

towards generating many small classes from one large target inopposition to a number of targets ending up within one largeclass. The clustering algorithm is applied to the output of thefovea-aided motion detector - the moving pixels - and yieldsclasses - moving regions formed by concatenating such pixels.Figure 4 shows an example.

(a) (b)

(c) (d)

(e) (f)

(g) (h)Figure 4 - Clustering moving pixels into classes. a) Fovea-aided detected moving regions. b-g) Resulting classes. h)

Superclass resulting from class concatenation.

2.2.3 Clustering of classes into superclasses andtarget formation

The previous section showed how moving pixels are clusteredinto classes according to distance. However, such classescannot be regarded as distinct targets yet because one targetcan be spatially spread over a number of classes. It is assumedthat nearby classes are likely to belong to the same target. Oncemore a distance metric is the basis of a criterion for acceptingor not the merger of classes into a superclass. Additionally, theresulting density is also considered before accepting a merger.Each i-th superclass Sci i=1,...,n is a subset of the n existing

classes. Each class is to be contained by only one superclass.Each superclass is initialized as an empty set and each non-empty one that results by the end of all mergers becomes atarget. Let ),(d ji cc represent the Euclidean distance on the

image plane between the centroids of classes ci and cj and

dens(ci) the i-th class density. The logical-valued criterion

C(ci,cj) for accepting a merger of ci with cj into the k-th

superclass is as follows (see Appendix A):

))c(dens).3/2()cc(pix#

)c(pix).#c(dens)c(pix).#c(dens

)cc(dens(

AND)L5,2),(d()c,c(C

iji

jjii

ji

d2jijiji

≥∪

+

=∪

≤−== cccc

(3)

The density attribute enters the accceptance criterion to reducethe detrimental effect of image noise. In general, stabilizationerrors cause image artifacts that, when not eliminated by thegray-level opening operation, contain a few sparsely distributedpixels incorrectly detected as moving. A merger with such aregion erroneously accepted as a class would not significantlyalter the resulting number of pixels. The same reasoning mayhowever be incorrect for the density attribute, since there couldbe a significant increase in target area. If such a merger wereaccepted into a superclass, the resulting target would be largeand composed of sparse pixels within some of its parts. Targetswith low density should not raise the attention of the visionsystem.

No sort of motion analysis is used to decide whether a mergershould be accepted or not. Figure 5 shows that adequatesegmentation is achieved for a pair of targets. As they approachone another the previous result blends into one target though.Target attributes and data available for further processing arethe number of targets, the density and number of moving pixelsin each target, the estimated centroid and the correspondinggray-level patterns around the centroid of each target.

2.3 Monocular velocity estimation

Target velocity on the image plane is a feature that enters thecomputation of the target interest value. Rapidly movingobjects are more likely to leave the visual field before thesystem has an opportunity to track. In terms of a machinevision system, fast objects may bring about the danger ofcollision and in terms of biological vision systems, they mayrepresent an agile prey or predator. Hence, one should expectthat fast objects should raise more interest than slow ones.

Image velocity estimation is based on target displacementbetween consecutive images. As a number of targets may existwithin the field of view at a certain time, the establishment oftarget correspondence in consecutive images is required. Thecorrespondence criterion employs a variation of the sum-of-


squared-differences (SSD) method: the sum-of-squared-pixels(SSP) is used for normalization. Let It and It-1 represent

consecutive images acquired by the dominant camera, R2 an

MxN window in It containing a gray-level pattern around a

target centroid and R1 likewise in It-1 .

(a) (b)

(c)Figure 5 - Clustering classes into superclasses and

generation of targets. a-b) Consecutive images. c) Resultingtargets.

Consider further a search area of size PxQ around a targetcentroid in It-1 , P>M and Q>N. The need for a search area

originates in the displacement of the estimated centroid withinthe target borders from image to image due to changes inillumination, object pose and fluctuations in the motiondetector output. Then, for each combination of R2 and R1:

∑∑∑∑−

=

−

=

−

=

−

=++−=

1M

0i

1N

0j

22

1M

0i

1N

0j

212

b,a21 ]))j,i(R(/))bj,ai(R)j,i(R([min)R,R('SSD

a∈ {0,1,...,P-M}, b∈ {0,1,...,Q-N} (4)

The criterion for establishing a corresponding pair of gray-levelpatterns is:

ac21SR,SR

*2

*1 L)]R,R('SSD[min)R,R(corresp

2211

≤=∈∈

(5)

where S1(S2) is the set of remaining R1(R2) windows left for

correspondence. Whenever a correspondence is established the

corresponding )R(R *2

*1 is removed from S1(S2). The

inequality refers to a threshold set to 0.15 for accepting acorrespondence as valid. Targets in the most recent image thatcould not find an acceptable correspondence are labeled as newand their velocities are set to null. A list of correspondingtargets is built and the associated velocity estimates are storedfor later use by the attentional management module.Empirically tuned window dimensions are M=N=5 andP=Q=21. It is a compromise between the preservation of theinformation content and the desired robustness to unexpectedimage variations and to changing background texture aroundthe target. The former precludes too small windows whereasthe latter too large ones.

Figure 6 displays an instance of target correspondence betweenimages. Target 2 is properly corresponded along the imagesequence until its occlusion. Segmentation produces only onetarget as both approach one another. Minute motion causestarget 1 to be missing in figure 6b. Minute motion, thoughdetected by image subtraction, is further deleted by the gray-

level opening operation described in Section 2.1. Target 1reappears as new target 3 in figure 6c because thecorrespondence process does not store a gray level pattern formore than one image.

(a) (b)

(c) (d)

Figure 6 - Monocular SSD-based correspondence along animage sequence.

Targets with similar textures may yield false correspondencesand the SSD criterion is known to be susceptible to rotations,geometric deformations and illumination changes (Hager andBelhumeur,1996). Moreover, ambiguities arise due to theaperture problem. The function to minimize in equation (4)generates a surface whose principal curvatures are useful todetermine the directions of the largest and the smallestuncertainties when establishing a correspondence betweentargets. A small curvature indicates large uncertainty and viceversa. Once a correspondence is produced, the targetdisplacement has more uncertainty in its component along thedirection of smallest curvature. This effect is often due to arather homogeneous target texture along this direction. Theuncertainty analysis regarding target correspondence will befurther elaborated in Section 4 which discusses the results.

2.4 Binocular Registration

Image disparity originates in the image acquisition fromdistinct points of view of a stereoscopic assembly as shown infigure 7. It depict the baseline b, the 3D point P and itsperspective projection Pl (Pr) on the left (right) image planeaccording to the projection center Ol(Or) and the change on theepipolar constraint caused by verging cameras. Disparity isrelated to object depth. Depth estimation requires knowledge ofparameters such as focal length, baseline and vergence angle.For the purpose of managing the system attention, disparityestimation suffices and normalized correlation-basedregistration operating on a pyramidal data structure is used.The vergence angle of the non-dominant camera is controlledaiming at nulling the estimated disparity. Ideally, the camerasare oriented in such a way that the projection of the trackedobject should fall upon their respective regions of fixation. Thiscontrol employs correlation which requires a rich texturearound the target centroid. Normalization reduces thesensitivity of the correlation method to illumination changes,but the method suffers performance limitations due to rotationand scale factor changes. The use of a pyramidal data structureallows a coarse-to-fine solution to the registration problem


which further helps to reduce the computational workload. Apyramidal data structure is built for each of the stereo images.It consists of three levels of resolution: the original 160x120(level 0), 40x30 (level 1) and 20x15 (level 2). Levels 1 and 2are generated by partitioning level 0 into 4x4 and 8x8windows, respectively, and computing the average gray level ateach window.

Figure 7 - Effect of camera vergence on the location of theepipolar constraint.

The registration of a target previously detected on the dominantimage is made by correlating the gray-level pattern around itscomputed centroid with patterns within a search area. This areais positioned around the epipolar constraint on the non-dominant image, as seen in figure 8. Due to the vergence anglecontrol for binocular fixation, the epipolar constraint movesand does not coincide with a row of pixels. This particular caseonly occurs when the cameras are aligned with parallel opticalaxes. To cope with the varying position of the epipolarconstraint on the non-dominant image, a 3-pixel-wide searcharea in level 2 is defined. The corresponding search area on thenon-dominant image in level 0 is 13 pixels wide about the rowof pixels which, on the dominant image, contains the targetcentroid. Thus, the required computational effort is reducedand the rate of correctly estimated disparities is improved.

Figure 8 - Search area encompassing the epipolarconstraint.

Normalized correlation is as follows:

))ej,di(g())j,i(f(

))ej,di(g),j,i(fcov()g,f(c e,d,ij ++σσ

++= (6)

where f(.,.) and g(.,.) are the dominant and non-dominantimages, respectively, cov(.,.) indicates covariance and s(.) thestandard deviation. The position of a target centroid on the

dominant image array is given by indices i,j and the candidatedisparity components on the non-dominant image by d,e. Thefollowing statistics estimators circumvent redundantcomputations and yield the correct result when used inequation (6) (Sun, 1997):

gf)1L2)(1K2(

)en,dm(g)n,m(f))ej,di(g),j,i(fcov(Ki

Kim

Lj

Ljn

++−

++=++ ∑ ∑

+

−=

+

−=(7)

2Ki

Kim

Lj

Ljn

22 f)1L2)(1K2()n,m(f))j,i(f( ++−

=σ ∑ ∑

+

−=

+

−=

(8)

2

Ki

Kim

Lj

Ljn

22

g)1L2)(1K2(

)en,dm(g))ej,di(g(

++−

++=++σ ∑ ∑

+

−=

+

−= (9)

with f and g as the average gray level of each pattern. Pattern

sizes at each resolution level of the pyramid are 7x7 (K=L=3)at level 0, 5x5 (K=L=2) at level 1 and 3x3 (K=L=1) at level 2.Candidate disparities in level 2 are ordered from the largest tothe smallest correlation value computed from equation (6). Acoarse-to-fine procedure follows as the best candidate disparityin level 2 is refined at level 1 and subsequently at level 0 byusing as initial guess the best disparity found at the previouslevel with a coarser resolution. The best candidate at level 0 is

accepted as the best registration ( ** e,d ) according to thefollowing criterion, which removes the average gray level inorder to reduce the sensitivity to illumination changes:

( ) disp

2i

2im

2j

2jn

**gfe,d,ij

L)en,dm(e~)n,m(e~e ** ≤++−= ∑ ∑+

−=

+

−=

(10)

∑ ∑+

−= −=−=

2i

2ia

2j+

2jbf b)f(a,

25

1)n,m(f)n,m(e~

∑ ∑++

−+= −+=

−++=++2di

2dia

2+ej+

2ejb

****g

*

*

*

*

b)g(a, 25

1)en,dm(g)en,dm(e~

If the inequality in equation (10) is violated then the secondbest candidate disparity of the ordered list at level 2 undergoesthe coarse-to-fine procedure and is tested against the abovecriterion. The process continues until a candidate disparity isaccepted. If none passes the inequality then the disparity islabeled void. This label is also employed when a smallstandard deviation in equation (6) indicates a ratherhomogeneous texture which does not suffice for the purpose ofbinocular registration.

2.5 Attentional Management

The presence of multiple targets within the field of view raisesthe need for an adequate allocation of attentional resources.Some criteria for such allocation are found in the literature.Balkenius and Kopp (1996) employed edge motion intensity toevaluate an attentional field over the image. Culhane andTsotsos (1992) proposed an attentional model which Andersen(1996) combined with a scale-space representation similar tothe image pyramid employed here. A combination of 42normalized feature maps - contrast, color and edge orientationamong them - yielded a saliency map whose peaks wereselected via a winner-take-all neural net in the work by Itti etallii (1998). Ahuja and Abbott (1993) presented some


experimental results on the psychophysics of human visionwhich support proximity to the fovea, close fixation points andclose range to objects as attentional attractors.

The attentional features available for the i-th target, i=1,...,n,are its velocity magnitude |veloc(i)| on the image plane, numberof moving pixels #pix(i), density dens(i) and binoculardisparity disp(i) if the latter is not void. They are linearlycombined in the attention function I(i) along with a timemeasurement of how long a target occurs in consecutiveimages:

)i( disp))i( timer1(

)i( dens)i( veloc)i( pix#)i(I

54

321

α+α−++α+α+α=

(11)

where the weights were tuned during the experiments to thefollowing values:

;05.0 ,3.0 ),120.160/(8 431 =α=α=α

−

=α

=αotherwise 1/160

null=disp(n) if 0 ;

otherwise )velocmax_.3/(1

0=pixels/s]max_veloc[ if 052

)t(veloc maxvelocmax_n,...,2,1t=

=

Tuning was carried out according to the followingconsiderations:

a) The maximum possible value for #pix(.) is the total numberof pixels on the image. The factor 8 in the numerator of α1provided an adequate balance between the influence of the#pix term and that of the other terms in the attentionfunction. The normalization by max_veloc and the factors1/3 in α2 and 0.3 in α3 follow a comparable justification;

b) The timer term has the purpose of gradually decrementingthe interest on old targets. New targets should raise thesystem attention because they might bring new importantinformation about the evolving scene contents within thefield of view;

c) The normalization of α5 uses the maximum value which can

be attained by the horizontal component of a disparity. Atarget depth smaller (larger) than that of the fixation pointin the 3D space yields a negative (positive) disparity. Thefixation point is here the 3D point with null disparitybecause it is, ideally, located at the intersection of theoptical axes. The negative sign of α5 results in a higher

attentional value for a near object than that of an objectbeyond the fixation point. Therefore, the control of thevergence angle for binocular fixation, in spite of alteringthe disparity magnitudes, still preserves their relative effecton the attention function.

A dynamic list is built with entries being created or deleted astargets on the images appear and disappear. Only theattentional features of the most recent pair of consecutiveimages are stored in the list to limit its complexity. If a targeton the most recent image is not in registration with any targeton the previous image then it is labeled a new target. If nomotion at all is detected then the attentional features are kept inthe list for a maximum of eight consecutive images duringwhich the vision head continues with its smooth pursuit in anattempt to find the target again. Such caution derives from thepossibility of target minute motion being filtered out by the

grey-level opening operation or by target occlusion. If nomotion is detected after that then the system halts its motionand static search resumes.

The target selected for tracking is the one that maximizesequation (11). However, incorrectly detected tiny targets thatarise due to errors in background motion compensation andunsuccessful filtering by the opening operation often possess ahigh density. Such is the case of a 1-pixel false target.Moreover, when such a case occurs, the respective timer termin equation (11) attains its maximum value. To cope with suchan artifact it is required that the selected target occurs threetimes, not necessarily on consecutive images, before the visionhead performs a ballistic motion to reorient and resumetracking. If no target is in agreement with this constraint it isreduced to two occurrences and, if necessary, to only oneoccurrence. This means that the restriction is temporarily eithersoftened or disabled. In either situation, the target maximizingequation (11) is selected but not tracked. Selection in spite of ano-track status is required to update the state of a centroidposition filter which smooths the head motions duringmonocular tracking and binocular fixation.

2.6 Binocular Fixation

Binocular fixation aims at correctly orienting the cameras in tiltand vergence in such a way that the selected target stays withina 10x10 region of fixation about the center of the stereoimages. Control of either the camera angular position or itsvelocity depends on the type of motion which is engaged.

Ballistic motion occurs when the camera attitude shouldrapidly change as the attentional management selects a newtarget for tracking. It consists of a very fast angulardisplacement with a strong acceleration at the motion onset andoffset. It is activated only after a newly selected target occursthree times. As mentioned in the previous Section, it helps toreduce the impact of image noise and of backgroundcompensation errors. Overshoot and oscillations at motiononset and offset result in image blur and erroneous encoderreading which prevents image processing during ballisticmotion. Image processing is resumed at motion completion.Figure 9 shows the tilt and vergence angles θ and γ,respectively, the error components ex and ey, the focal distance

fu in pixels and the projection center O, all of which enter the

computation of the angular increment to be attained by aballistic motion.

Required angular increments in tilt and vergence in encoderunits are given by:

);4.44/(180tilt_pos_inc

);67.166/(180verg_pos_inc

πθ=πγ=

)f/e(tg );f/e(tg uy1

ux1 −− =θ=γ (12)

Saccades are corrections to the camera angular velocity so thatthe selected target is ideally kept within the region of fixation.A strong acceleration occurs at the motion onset and offset butimage processing is not interrupted. Required increments toangular velocities in tilt and vergence are computed as:

);t.q/(verg_pos_incverg_vel_inc ∆=)t.q/(tilt_pos_inctilt_velinc_ ∆= (13)

where q is the specified number of frames ahead and Dt theaverage time interval between frames. The computation of theincrements to angular velocity requires the specification of the


number of frames ahead in time to fulfill the required angularincrements.

Figure 9 - Tracking error on the image plane and requiredangular increments.

Smooth pursuit maintains the camera's angular velocity as longas the selected target stays within the region of fixation.Ideally, this target stabilization should produce a sharper imageof the tracked target. The size of the region of fixation cannotbe too large because large image displacements would resultduring tracking. The justification for that is that a large targetmotion would be allowed before the engagement of acorrecting saccade. The non-dominant camera does notperform smooth pursuit. A saccade correcting its angularvelocity is performed after each image acquisition. Wheneverthe disparity of the tracked target is void, binocular fixationassumes the last valid disparity in spite of changing vergenceangles.

Fluctuations in the output of the target segmentation processdue to illumination changes and background texture cause thecomputed centroid of a target to suffer displacements which arenot related to its motion. However, attentional managementcopes with it as a target slipping away of the region of fixation.To reduced this undesired effect, the centroid coordinates ofeach target on the dominant image is filtered by:

)2k(cg1.0)1k(cg3.0)k(cg6.0)k(f_cg −+−+= (14)

where k indexes an image in a sequence. If the last threeoccurrences of the selected target are not consecutive or lessthan three occurrences are observed then the system resorts tothe unfiltered computed centroid for tracking purposes.

3 SYSTEM IMPLEMENTATION

The binocular tracking system was evaluated with theHelpmate Robotics Bisight vision head, nicknamed Otelo, atthe ITA/INPE Active Computer Vision and Perception Lab -LVCAP-ITA/INPE - (http:\\www.ele.ita.cta.br/~labvisao/)described in Viana et allii (1999). Image processing and thecomputation and transmission of command signals to Otelowere carried out by a 166Mhz Pentium MMX PC equippedwith 64 MB RAM, 2 GB SCSI hard-disk, a DataTranslationDT3152 monochrome video acquisition board and two AD-DAconversion boards. The latter exchanges data with the drivesthat control the Fujinon KPM1 servoactuated lenses. Off-linecalibration produced the estimate fu=221.8 pixels (Viana et

allii, 1999). Targets were made with paper and black tape toprovide texture. The targets underwent translation by means ofa setup of wires and pulleys. The physical dimensions of thelab precluded experiments with significant motion along the

depth direction and thus a more effective evaluation of thedisparity term in the attention function. Eventually, movingpeople were included in the experiments.

The system was developed on the Windows for Workgroups3.11 platform and coded in three distinct programminglanguages: the graphic user interface in Visual Basic 3.0, thetracking routines in the Visual C++ 1.51 16 bit environmentand the most computationally intensive image processingroutines in 80486 Assembly and compiled with TurboAssembler 5.0. For comparison purposes, image subtraction formotion detection took about 11 ms when coded in C and 1 msin Assembly. The application of the gradient operator on a160x120 image consumed about 38 ms when coded in C and11 ms in Assembly. Routines coded in Assembly are: imagesubtraction, gray-level opening, thresholding, logical ANDingof binary images, the generation of foveated images andnormalized correlation. Image capture consumed 130 ms as theonly acquisition board first selects a camera and then acquiresthe image for each camera of the stereo assembly.

Table 1 depicts typical time intervals elapsed when the mainprocessing tasks are performed. Binocular fixation showssignificant variations that depend on the performed motion.Ballistic motion requires a full stop by the vision head beforeresuming image processing. It is followed by a saccade whichissues an angular velocity update command to the head drive.The host computer awaits a confirmation that the commandwas successful to resume the image processing. Finally,smooth pursuit by the dominant camera includes a saccade bythe non-dominant camera.

Static search (ms) 553Monocular motion localization and target segmentation (ms)

721

Monocular estimation of targetvelocity (ms)

195

Binocular registration (ms) 215Attentional management < 1Binocular fixation (ms) ballistic motion: 709

saccade: 177smooth pursuit: 65

Table 1 - Typical computational workload of the mainprocessing tasks.

Static search for motion and monocular target segmentation, inspite of mostly implemented in Assembly, are in a listed entrywhich includes the acquisition of stereo images and encoderreading. The latter required issuing a command to the headdrive and await until the measurement became available. Theframe rate attained during operation was about 0.8 stereo framepair/second due to limitations of both off-the-shelf hardwareand the software platform. This rate severely limited interframetarget motion during the experiments but did not invalidate theassessment of the overall system potential.

4 RESULTS AND ANALYSIS

Performance analysis is presented both quantitatively andqualitatively. A qualitatively successful performance ischaracterized by keeping the selected target within the field ofview of both cameras. A proposal for a quantitativeperformance assessment requires the definition of a metricwhich should be independent of a particular systemimplementation. The proposed metric is the lock-on rate Pwithin a 30x30 test window about the center of the stereoimages which is given by:


t

c

N

NP = (15)

where Nc represents the number of frames in which the tracked

target centroid is within the test window and Nt the number of

frames in which this target is selected for tracking. The latterdoes not include the three occurrences before which a ballisticmotion is not engaged for attentional reorientation.

Four experiments were devised. They consisted of stereosequences lasting 120s. Experiment A consisted of a rigidtarget under mainly horizontal translation. Experiment Bconsisted of two translating rigid targets aiming at anevaluation of the attentional management process. ExperimentC consisted of a moving person which led to various targetsundergoing non-rigid motion. Finally, experiment D added arigid target to the latter experiment. Excerpts of the stereoimages acquired in the experiments are found in Appendix B.Each detected target received an identification tag for as longas it was tracked. The numerical tag on the selected target iswhite over a black blackground whereas other targets havetheir tags with inverted gray levels. Image pairs correspondingto dominant and non-dominant cameras are displayed right andleft, respectively. The 10x10 region of fixation and the 30x30test window are delimited in the display by a black trace and awhite one, respectively, about the dominant camera imagecenter. Since the non-dominant camera does not undergosmooth pursuit, no region of fixation is displayed on its image.

Frame rate is affected mainly by the elapsed time required tocomplete a ballistic motion, saccade or smooth pursuit and bythe number of moving pixels which form the targets.Generally, good results were obtained in sequence A. SequenceB presented less accurate tracking mainly due to occasionalfailures of the SSD-based target correspondence along thesequence.The occurrence of numerous targets - caused byrotation and non-rigid motion - in sequences C and D causeddisputes for the system attention which resulted in a rathererratic behavior. Such performance degradation was caused bysignificant interframe displacements as moving people went infront of the cameras with their gait. Dealing with this conditionpushed the system to the limits of its implementation. Thesubjects were maintained within the field of view during mostof the sequences though. The workload of sequentiallyprocessing the images and the sluggish vision head controlcycle are the main reasons for the present bottleneck. Thecontrol cycle initiates with the issuing of commands from thehost computer to the vision head drives and ends with theemission by the drive to the host computer of a message thatthe requested command was accomplished. Table 2 presentsthe number of stereo pairs in each experiment and theequivalent stereo pair rate. Uhlin (1996) and Andersen (1996)reported image processing and head control with special-purpose hardware at a rate of 25 and 10 frames per second,respectively.

Stereo pairs Rate (stereo pairs persecond)

Experiment A 97 0.8Experiment B 106 0.9Experiment C 78 0.6Experiment D 78 0.6

Table 2 - Average operating rate during each experiment.(120s-long stereo sequences)

In experiment A, the SSD-based method for correspondingtargets along the sequence of dominant images presented

incorrect results. The only target on the image was in twooccasions labeled as new and three targets were produced.Correlation-based binocular registration was more susceptible:13 stereo pairs yielded void disparities and other three stereopairs produced incorrect disparities due to similarities betweenthe target texture and that of the background. As Ldisp in

equation (10) allowed for an average mismatch magnitude of25 gray levels per pixel, it is claimed that correlation-basedregistration seems to be quite sensitive to the experimentconditions in spite of its normalization. It could have beencaused by the combination of the changing vergence anglesand a low frame rate, which could translate into illuminationdifferences, changes in texture projection onto each camera anda varying scale factor.

Experiment B resulted in 23 detected targets. One reason forthat is that the system has no memory of the targets: wheneverone leaves and later reenters the visual field it is labeled asnew. Another factor was erratic operation of the SSD-basedmethod when a target had its computed centroid near itsboundary with the background. In such a case, some of thebackground texture entered the gray-level pattern window R1or R2. The target motion affected the appearance of the

surrounding background texture as occluded backgroundbecame uncovered. The experiment showed that the SSD-basedmethod is effective when a target with a rich texture movesover a background with poor texture.

Experiment C resulted in 49 detected targets. The SSDcriterion showed good results in spite of some targetsundergoing a motion other than translation, such as target 3 inC.5-C.14. Human subjects moving in the visual field oftenproduced a large number of targets. For instance, three targetswere detected in C.8-C.10 because the subject projection on theimage was large and its more distant portions moved non-rigidly. Were the target small and its motion would yield aunique target in spite of being non-rigid.

Artifacts in experiment D were caused by backgroundcompensation error. They were observed as targets 8 and 9 inD.17. Sensitivity to occlusion in subtraction-based motiondetection produced targets 47 and 48 in D.75.

Vergence control of the non-dominant camera successfullymaintained the selected target within the field of view in spiteof inaccurate correlation-based binocular registration. Thequantitative analysis in Tables 3-6 is based on the metricdescribed in equation (15). The intermediate results areexpressed for those targets in each experiment that raised thesystem attention the most and were thus tracked for moreframes.

Target 1 Target 2 Target 3Target detection (frames) 63 7 22Correct binocularregistration (stereo pairs)

54 6 18

Target selected andtracked, Nt (frames)

60 4 19

Centroid within the testwindow (dominant / non-dominant), Nc (frames)

47/42 4/3 17/13

Lock-on rate(dominant/non-dominant),P (%)

78/70 100/75 89/68

Table 3 - Quantitative analysis. Experiment A.


Target6

Target8

Target22

Target detection (frames) 10 22 19Correct binocular registration(stereo pairs)

7 18 17

Target selected and tracked, Nt(frames)

6 19 16

Centroid within the test window(dominant / non-dominant), Nc(frames)

5/4 19/5 9/8

Lock-on rate (dominant/non-dominant), P (%)

83/67 100/26 56/50

Table 4 - Quantitative analysis. Experiment B.

Target 3 Target 19 Target 28Target detection(frames)

16 12 15

Correct binocularregistration (stereopairs)

7 2 5

Target selectedand tracked, Nt(frames)

8 4 10

Centroid withinthe test window(dominant / non-dominant), Nc(frames)

0/0 0/0 0/0

Lock-on rate(dominant/non-dominant), P (%)

0/0 0/0 0/0

Table 5 - Quantitative analysis. Experiment C.

P was often smaller for the non-dominant image because ofvoid disparities and the acceptance of incorrect ones. Largeoscillations in the attentional feature estimates producedfrequent attentional shifts as can be seen in C.6-C.12.Fluctuations in the estimated location of a target centroid andthe incorrect correspondence by SSD-based method - causedby a similar texture - are seen in target 20 in D.29-D.30. Suchoccurrences negatively impact on the tracking accuracy oftargets undergoing non-rigid motion. The above support theclaim that, as it is implemented, the system is not able toconsistently maintain within the test window a nearby humansubject with its characteristic gait which is a highly non-rigidmotion. In experiment D an additional small target undergoingtranslation was tracked in a much smoother way than thosetargets that originated in the non-rigid motion of the humansubject. To overcome this limitation, a higher image processingrate and a faster control cycle - encompassing commandissuing, head motion control and sensor reading tasks whichultimately integrate the host computer and the vision headdrive - are required. The former encourages a biologically-inspired parallellization of the vision algorithms whereas thelatter depends on major changes being made to the drivehardware that presently controls the vision head. After suchmodifications, a target slipping away of the region of fixationwould be expected to be detected before reaching a significantdisplacement. Furthermore, it would keep the correctingsaccade amplitude small as well as improve the accuracy androbustness of both correlation and SSD-based methods.

Target 20 Target 21Target detection(frames)

8 14

Correct binocularregistration (stereo pairs)

4 7

Target selected andtracked, Nt (frames)

5 9

Centroid within the testwindow (dominant/non-dominant), Nc (frames)

1/1 6/5

Lock-on rate, P (%) 20/20 67/56Table 6 - Quantitative analysis. Experiment D.

The performance of the attentional management was impactedby the quantity and quality of the information extracted by thevision algorithms. Figure 10 shows an instance of two targetscompeting for the system attention. Frequent attentionalreorientation as observed in C.6-C.11 are shown in figure 11along with the contribution of each term to the attentionfunction in equation (11). Undesired reorientation occurredbecause of large variations in the estimates of attentionalfeatures caused by a low frame rate. The latter can be tracedback to the emphasis put on the use of off-the-shelfunexpensive hardware, on the current sequentialimplementation of the system and on the vision head controlcycle. Attentional management is expected to become moreaccurate as image processing rate increases.

Figure 10 - Targets 21 and 22 in stereo pairs B.87-B.90 in adispute for system attention.

Section 2 briefly mentioned the uncertainties that arise in targetcorrespondence due to the aperture problem. The computationof either the SSD/SSP or the normalized-correlation criteriagenerates a surface of which the search area is its domain ofminimization. The aperture problem is characterized by a lackof information which translates into a small surface curvaturealong a direction which is significantly aligned with the texturepattern. The analysis of the surface curvature is then useful tolocate the directions along which uncertainty is either at itsmaximum or its minimum. Assuming that the surfacegenerated by either criterion can be approximated by aquadratic f(x,y) in the neighborhood of the computed minimumat coordinates xc,yc , the principal directions along which the

curvature is at a maximum or a minimum are given by theeigenvectors of the Hessian:

1

T

(e

me

Figure 11 - Targets 3 and 4 in stereo pairs C.6-C.11 in a dispute for the system attention. From top to bottom: a) Interestvalue. b) Density. c) Number of moving pixels. d) Velocity magnitude. e) Timer term. f) Binocular disparity.


cc

2

22

2

2

2

cc y,xat evaluated

y

)y,x(f

xy

)y,x(f

yx

)y,x(f

x

)y,x(f

)y,x(

∂∂

∂∂∂

∂∂∂

∂∂

=H (16)

he largest eigenvalue l1 indicates the maximum curvature

smallest uncertainty) along the direction of its associatedigenvector u and likewise for the smallest eigenvalue l2 , the

inimum curvature (largest uncertainty) and associatedigenvector v. Hence:

).()()(..).( TTTTT11 vHuvHuvHuvHuvuvu ====λ=λ (17)

where the dot represents the inner product operator. Because ofthe assumption of the quadratic approximation, f(.,.) iscontinuous and smooth and thus:

)y,x()y,x(xy

)y,x(f

yx

)y,x(f T22

HH =⇒∂∂

∂=

∂∂∂ (18)

and because H(.,.) is symmetric, from equation (17) thefollowing holds:

).().().().( 22T vuvuHvuvHu λ=λ== (19)

Assuming further, without any loss of generality in thesituations usually encountered, that λ1≠0, λ2≠0 and λ1≠λ2, the

subtraction of equations (17) and (19) yields:

0. 0).)(( 21 =⇒=λ−λ vuvu

SBA Controle & A

and thus the directions of maximum and minimum uncertaintyare orthogonal under the above assumptions. The partialderivatives in equation (16) are approximated with discreteoperators. Now consider, for instance, two consecutivedominant images on which two moving targets are detected asin figure 12. Target correspondence is required to estimate thevelocity of each one.

Figure 12 - Consecutive images It-1 and It for uncertainty

analysis.

Figure 13 shows the global minimum of SSD/SSP surfacesgenerated for each possible correspondence with a '*"-mark.Table 7 depicts the minima attained for each possibility. Thesearch for the minimum furnishes the correct result up tofluctuations which occur in the computation of the centroidlocation.

The eigenvectors for the correct correspondences are in figure14. By describing the interframe displacement according to theeigenvector base, a curvature-based confidence measure ineach direction can be used to adaptively change the weightsthat multiply the velocity and disparity terms in the attentionfunction (equation (11)) and thus reduce their importance whenthe available information is of poor quality. Moreover, theconjugation of such confidence measure with the minimizationprocedure seems encouraging to disambiguate non-uniquecorrespondence solutions.

Folk and allii (1992) claimed that the exogenous (involuntary)allocation of attention is not the result of relatively inflexible,"hard-wired" mechanisms triggered by specific stimulusproperties. Such type of allocation could be configured or set torespond selectively to a property that signaled the location ofstimuli that were relevant to task performance. Suchconfiguration was called the "attentional control setting", anendogenous (voluntary) control factor analogous to the tuningof the attention function which drives the system behaviorduring tracking in the present work. The setting was assumedto be determined by current behavioral goals. Once this settingwas established, events that exhibited the critical propertiessummoned attention involuntarily, whether or not the eventswere relevant to task performance. Stimuli not exhibiting suchproperties would not summon attention. Involuntary(undesired) shifts would be thus mediated by tunableattentional control settings which would vary according tocurrent behavior goals. In the computational model of attention

Figure 13 - Surfaces generated by equation (2.6.1) withcomputed minimum for each possibility of correspondence.From top to bottom: a) Target 0 in It with target 0 in It-1. b) 0

in It with 1 in It-1. c)1 in It with 0 in It-1. d) 1 in It with 1 in

It-1.

utomação Vol.11 no. 03 / Set., Out., Nov, Dezembro de 2000 199


proposed by Koch & Ullman (1985), attention would beallocated to the location with the highest activation in asaliency map. Cave and Wolfe (1990) claimed that the strengthof the local saliency should depend on bottom-up featuraldissimilarity relative to a neighborhood and also on top-downinfluence dictated by current behavioral goals. The top-downinfluence should increase the salience signals at locations thatcontained task-relevant stimulus properties. The interrelationbetween endogenous attentional control settings (and itsanalogous in this work - the setting of the attention function tothe conditions of the laboratory experiments with the purposeof producing an acceptable system behavior) and exogenousinvoluntary allocation of attention (which, similar to this work,shows as an apparently inadequate behavior, such as whenattention is allocated to undesired locations which neverthelesspresent the expected stimuli) defines a dichotomy that possiblyrepresents the delicate and yet efficient balance between thenecessary rigidity to ensure that potentially importantenvironmental events do not go unprocessed and the flexibilityto adapt to changing circumstances and goals.

5 CONCLUSIONS

This paper describes the development and evaluation of abinocular vision tracking system augmented with an attentionalmanagement capability. It selects one among multiple targetsaccording to the relative values achieved by an operator-tunedattention function. The main contribution is the augmentationof the monocular visual tracking concept as proposed inMurray and Basu (1994) with an attentional managementcapability and binocular fixation. Foveated images are used toreduce the computational workload inherent to detecting andsegmenting multiple moving targets. The integration andexperimental evaluation, both quantitatively and qualitatively,of various vision algorithms such as motion detection andtarget segmentation, SSD- and correlation-based registrationalong a sequence and across stereo pairs, attentional

management and binocular fixation, all aiming at the extractionof information useful for controlling the head motion, are notoften found in the literature. Recent related work is found inUhlin (1996), Andersen (1996), Araújo et allii (1996ab),Batista et allii (1997), Eklund et allii (1995) and Molton(1998). Each employs distinct hardware configurations andexperimental setups which make a comparative performanceevaluation rather complicated. For instance, the dedicatedhardware employed in Uhlin (1996) and Andersen (1996) wasessential to attain the reported 25 and 12 frames per second,respectively, whereas 0.8 stereo pairs per second is reportedhere. The sequential implementation described here is aconsequence of the emphasis put on the use of unexpensiveoff-the-shelf PC-based equipment. Eventually, such an optionimposed limitations in terms of frame rate and the ensuingtarget motion that could be adequately dealt with by thesystem.

Target in It Target in It-1 SSD/SSP

0 0 0.00160 1 0.11951 0 0.27561 1 0.0262

Table 7 - Minimum values of the SSD/SSP for each possiblecorrespondence.

System performance is acceptable when coping with smalltargets undergoing translation but suffers degradation whenlarge objects undergoing non-rigid motion enter the visual fieldsuch as the gait of human subjects moving at close range. Theattentional management as proposed here displays undesiredattentional reorientations due to interframe changes in theattentional features which are caused by a combination ofvarying illumination, scaling factor changes and rotation. Suchchanges cause a lock onto a different target whose featuresappear more similar to those of the target previously selected.The compromise between an excessive specialization by fine-tuning the attention function and an adequate flexibility toprovide for the processing of new events that present apotential to raise the system interest is discussed. The analysisof the uncertainties in the estimation of attentional features andthe potential of adaptively learning the attention function-either of the weights in equation (11), or in an even broadersense, of its structure - as goals change and perceptions of theenvironment are gathered seem to combine in a fertile field ofresearch on mechanisms that aim at providing a robotic systemwith an active vision capability with true autonomy to manageits limited computational resources.

Acknowledgements:

The authors wish to acknowledge the support of Dr.AntonioFrancisco Jr., the partnership with the Instituto Nacional dePesquisas Espaciais (INPE) and the funds provided byFAPESP to the joint ITA/INPE research efforts conducted atthe Active Vision and Perception Laboratory (FAPESP projectno.0620/95).

6 REFERENCES

Ahuja,N. and Abbott,A.L.(1993). Active Stereo: IntegratingDisparity, Vergence, Focus, Aperture and Calibrationfor Surface Estimation. IEEE Transactions on PatternAnalysis and Machine Intelligence, vol.15, no.10, 1007-1029.

Figure 14 - Directions of maximum and minimumuncertainties. a) Target 0 in It with target 0 in It-1. b) 1

in It with 1 in It-1.


Andersen,C.S.(1996). A Framework for Control of a CameraHead. Ph.D. dissertation, Laboratory of Image Analysis,Institute of Electronic Systems, Aalborg University

Araújo,H.;Batista,J.;Peixoto,P. and Dias,J.(1996). GazeControl in a Binocular Active Vision System Using

Optical Flow. Proceedings of the 4th InternationalSymposium on Intelligent Robotics Systems.

Araújo,H.;Batista,J.;Peixoto,P. and Dias,J.(1996). PursuitControl in a Binocular Active Vision System Using

Optical Flow. Proceedings of the 13th InternationalConference on Pattern Recognition, 313-317.

Balkenius,C. and Kopp,L.(1996). Visual Tracking and Targetselection for Mobile Robots. IEEE Proceedings ofEUROBOT'96, 166-171.

Bandera,C.;Vico,F.J.;Bravo,J.M.;Harmon,M.E. and BairdIII,L.C.(1996). Residual Q-Learning Applied to Visual

Attention. Proceedings of the 13th InternationalConference on Machine Learning, 20-27.

Bar-Shalom,Y. and Li,X.-R.(1993). Estimation and Tracking:Principles, Techniques and Software, Artech House.

Batista,J.;Peixoto,P. and Araújo,H.(1998). Real-Time ActiveVisual Surveillance by Integrating Peripheral MotionDetection with Foveated Tracking. Proceedings of the1998 IEEE Workshop on Visual Surveillance, 18-25.

Batista,J.;Peixoto,P. and Araújo,H.(1997). Visual Behaviors forReal-Time Control of a Binocular Active VisionSystem. IFAC Algorithms and Architectures for RealTime Control AARTC'97.

Bispo,E.M. and Waldmann,J (1998). Saccadic Motion Controlfor Monocular Fixation in a Robotic Vision Head: AComparative Study, Journal of the Brazilian ComputerSociety - Special Issue on Robotics, vol.3, no.4, 61-69.

Cave,K.R. and Wolfe,J.M.(1990). Modelling the Role ofParallel Processing in Visual Search. CognitivePsychology, no.22, 225-271.

Cowie,R. and Taylor,J.(1997). The Moving Eye: A Model for

the 21st Century Sensor? Proceedings of the 13th

International Conference on Digital Signal Processing,255-260.

Crisman,J.D.;Cleary,M.E. and Rojas,J.C.(1998). TheDeictically Controlled Wheelchair. Image and VisionComputing, vol.16, 235-249.

Culhane,S.M. and Tsotsos,J.K.(1992). A Prototype for Data-

Driven Visual Attention. Proceedings 11th InternationalCongress on Pattern Recognition, 36-40.

Dagless,E.L.;Ali,A.T. and Cruz,J.B.(1993). Visual RoadTraffic Monitoring and Data Collection. Proceedings ofthe IEEE-IEE Vehicle Navigation and InformationSystems Conference, 146-149.

Eklund,M.W.;Ravichandran,G.;Trivedi,M.M. andMarapane,S.B.(1995). Adaptive Visual TrackingAlgorithm and Real-Time Implementation. IEEEInternational Conference on Robotics and Automation,2657-2662.

Fairhurst,M.C.(1988). Computer Vision for Robotic Systems:An Introduction. Prentice Hall.

Folk,C.L.;Remington,R.W. and Johnston,J.C.(1992).Involuntary Covert Orienting Is Contingent onAttentional Control Settings, Journal of ExperimentalPsychology: Human Perception and Performance,vol.18, no.4, 1030-1044.

Hager,G.D. and Belhumeur,P.N.(1996). Real-Time Tracking ofImage Regions with Changes in Geometry andIllumination. Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 403-410.

Huber,E. and Kortenkamp,D.(1995). Using Stereo Vision toPursue Moving Agents with a Mobile Robot. IEEEInternational Conference on Robotics and Automation,2340-2346.

Itti,L.;Koch,C. and Niebur,E.(1998). A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEETransactions on Pattern Analysis and MachineIntelligence, vol.20, no.11,1254-1259.

Koch,C. and Ullman,S.(1985). Shifts in Selective Attention:Toward the Underlying Neural Circuitry. HumanNeurobiology, vol.4, 219-227.

Molton,N.;Se,S.;Brady,J.M.;Lee,D. and Probert,P.(1998). AStereo Vision-Based Aid for the Visually Impaired,Image and Vision Computing, vol.16, 251-263.

Murray,D. and Basu,A.(1994). Motion Tracking with an ActiveCamera. IEEE Transactions on Pattern Analysis andMachine Intelligence, vol.16, no.5, 449-459.

Rotstein,H. and Rivlin,E.(1996). Control of a Camera forActive Vision: Foveated Vision, Smooth Tracking andSaccade. Proceedings of the 1996 IEEE InternationalConference on Control Applications, 691-696.

Scott,P.D.;Bandera,C.(1990). Hierarquical MultiresolutionData Structures and Algorithms for Foveal VisionSystems. IEEE International Conference on Systems,Man and Cybernetics, 832-834.

Shi,J. and Tomasi,C.(1994). Good Features to Track.Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 593-600.

Sun,C.(1997). A Fast Stereo Matching Method. Digital ImageComputing: Techniques and Applications, NewZealand, 95-100.

Uhlin,T.(1996). Fixation and Seeing Systems. Ph.D.dissertation, ISRN KTH/NA/P--96/10--SE ISSN 1101-2250:TRITA-NA-P96/10, Department of NumericalAnalysis and Computing Science, StockholmUniversity.

Viana,S.A.A.;Waldmann,J. and Caetano,F.F.(1999). Non-Linear Optimization-Based Batch Calibration withAccuracy Evaluation, Revista Controle & Automação -Special Edition on Computational Vision, vol.10, no.2,89-99.

Weiman,C.F.R. and Vincze,M.(1996). A Generic MotionPlatform for Active Vision. SPIE InternationalSymposium on Intelligent Systems and AdvancedManufacturing, Robotics and Intelligent Systems, USA.


Appendix A: Algorithm for clustering classes intosuperclasses and target formation

For all i,j; i=1,...,n and j > i:

{ if( C(ci,cj) )then

if( ∃ k < i | ci ⊂ Sck)then

{Sck=Sck∪ cj ; /* merges cj with superclass which ci belongs to */

if(∃ n < i; n ≠ k | cj ∪ Scn)then

{Sck=Sck∪ Scn;Scn={∅ };}/*merges superclasses containing classes*/

}else if(∃ k < i | cj ⊂ Sck)then

Sck=Sck∪ ci ; /* merges ci with superclass which cj belongs to */

else if (Sci={∅ })then Sci=ci∪ cj ;/* initiates superclass */

else Sci = Sci∪ cj ; /* merger with cj */

else if(j=n and Sci={∅ })then Sci=ci ; /*superclass with isolated

class*/

}n=0; /* target count */

∀ Sck ≠{∅ }, k=1,...,n

{n=n+1; /*n-th target - moving pixels, density, centroid location*/

∑⊂

=ki Scc

i )c(pix#)n(pix# ; )n(pix#

))c(pix).#c(dens(

)n(dens ki Sccii∑

⊂= ;

)n(pix#

))c(pix.#( iScc

i

n ki

∑⊂=

c

c ;

}

Appendix B: Excerpts from the experiments.

The following sequences of stereo pairs are displayed as non-dominant and dominant images (left and right, respectively)from top to bottom.

B.87 to B.91.


C.5-C.14.

D.16-D.17


D.26-D.34

D.74-D.78

Documents

ATTENTIONAL MANAGEMENT FOR MULTIPLE TARGET … · ATTENTIONAL MANAGEMENT FOR MULTIPLE TARGET TRACKING BY A ... Jacques Waldmann Instituto Tecnológico de Aeronáutica Departamento