Motion analysis for event detection and tracking with a …cvrr.ucsd.edu/publications/2004/paper_acmj.pdfDigital Object Identiﬁer (DOI) 10.1007/s00530-004-0146-3 Multimedia Systems

Digital Object Identifier (DOI) 10.1007/s00530-004-0146-3Multimedia Systems 10: 131–143 (2004) Multimedia Systems

Motion analysis for event detection and trackingwith a mobile omnidirectional camera

Tarak Gandhi, Mohan M. Trivedi

Computer Vision and Robotics Research Laboratory, University of California – San Diego, USA(e-mail: {tgandhi, mtrivedi}@ucsd.edu)

Published online: 11 October 2004 – c© Springer-Verlag 2004

Abstract. A mobile platform mounted with omnidirectionalvision sensor (ODVS) can be used to monitor large areas anddetect interesting events such as independently moving per-sons and vehicles. To avoid false alarms due to extraneousfeatures, the image motion induced by the moving platformshould be compensated. This paper describes a formulationand application of parametric egomotion compensation foran ODVS. Omni images give 360◦ view of surroundings butundergo considerable image distortion. To account for thesedistortions, the parametric planar motion model is integratedwith the transformations into omni image space. Prior knowl-edge of approximate camera calibration and camera speed isintegrated with the estimation process using a Bayesian ap-proach. Iterative, coarse-to-fine, gradient-based estimation isused to correct the motion parameters for vibrations and otherinaccuracies in prior knowledge. Experiments with a cameramounted on various types of mobile platforms demonstratesuccessful detection of moving persons and vehicles.

Keywords: Motion detection – Optical flow – Panoramic vi-sion – Dynamic vision – Mobile robots – Intruder detection –Surveillance

1 Introduction and motivation

Computer vision researchers have for long recognized theimportance of visual-surveillance-related applications whilepursuing some of the outstanding research issues in dynamicscene analysis, motion detection, feature extraction, patternand activity analysis, and biometric systems. Recent worldevents demand practical and robust deployment of video-based solutions for a wide range of applications [11,16,30,33]. Such wider acceptance of the need for the technologydoes not mean that these systems are indeed ready for deploy-ment. There are many important and difficult research prob-lems that remain to be solved. In this paper we present a studyfocused on one such challenging research problem, that of de-veloping an autonomous system that can serve as a “mobilesentry” to perform the tasks assigned to someone posted on

guard duty around the perimeter of a base. The mobile sen-try with video cameras should be able to detect “interesting”events and record and report the nature and location of theevent in real time for further processing by a human operator.This is indeed an ambitious goal and it requires the resolutionof several important problems from computer vision and in-telligent robotics. In this paper we focus on the problem ofdetecting and compensating for the egomotion of a mobileplatform. One of the novel features of our research is the om-nidirectional video streams we use as the input to the visionsystem.

When a camera is stationary, background subtraction isoften used to extract moving objects [20,39]. However, whenthe camera is moving, the background also undergoes egomo-tion, which should be compensated. To distinguish objects ofinterest from extraneous features on the ground, the ground isusually approximated by a planar surface whose egomotion ismodeled using a projective transform [12,29] or its linearizedversion. Using this model, the egomotion of the ground canbe compensated to separate objects with independent motionor height. This approach has been widely used for object de-tection from moving platforms [6,13,29].

Omnidirectional vision sensors (ODVS), or omnicameras,that give a 360◦ field of view of the surroundings are veryuseful for applications such as surveillance [5,8], robot nav-igation [42], localization [26], and wide baseline stereo [38,43]. The book by Benosman [3] gives a comprehensive reviewof the theory and applications of omnicameras. Motion esti-mation from moving ODVS cameras has recently been a topicof great interest. Rectilinear cameras usually have a smallerfield of view, which often causes the focus of expansion tolie outside the image, making motion estimation sensitive tocamera orientation. Also, the motion field produced by trans-lation along the horizontal direction is similar to that fromrotation about a vertical axis. As noted by Gluckman and Na-yar [18], ODVS cameras avoid both these problems thanksto their wide field of view. They project the image motionon a spherical surface using Jacobians of transformations todetermine egomotion of a moving platform in terms of trans-lation and rotation of the camera. Vassalo et al. [41] use thespherical projection to determine the egomotion of a movingplatform in terms of translation and rotation of the camera. Ex-

132 T. Gandhi, M.M. Trivedi: Motion analysis for event detection and tracking with a mobile omnidirectional camera

periments are performed using robotic platforms in an indoorenvironment, and the egomotion estimates are compared withthose from odometry. Shakernia et al. [35] use the concept ofback-projection flow, where the image motion is projected to avirtual curved surface instead of a spherical surface and makethe Jacobians of transformation simpler. Using this concept,they have adapted egomotion algorithms for use with ODVSsensors. Results using simulated image sequences show thebasic feasibility of the approach.

In our own research, the emphasis is on robustness, effi-ciency, and applicability in outdoor environments encounteredin surveillance and physical or base security. The main contri-bution of this paper is to perform detection of events, such asindependently moving persons and automobiles from ODVSimage sequences, and apply it to video sequences obtainedfrom a moving platform for surveillance applications.

2 Egomotion compensation framework for ODVS video

Parametric motion estimation based on image gradients, alsoknown as the “direct method”, has been used for rectilin-ear cameras for planar motion estimation, obstacle detection,and motion segmentation [24,28]. The advantage of the di-rect methods is that they can use motion information not onlyfrom cornerlike features but also from edges, which are usu-ally more numerous in an image. On the other hand, directmethods are more challenging for implementation, especiallyin outlier removal, and it is more difficult to track over frames.A comparison of the corner-based and direct-gradient-basedmethods is given in Table 1.

Here, the concept of direct method is extended for use withODVS. This approach was also used for detecting surround-ing vehicles from a moving car in [15,23]. An ODVS gives afull 360◦ view of the surroundings, which reduces the motionambiguities often present in rectilinear cameras. However, theimages undergo considerable distortion, which should be ac-counted for during motion estimation.

The block diagram of the event detection system is shownin Fig. 1. The initial estimates of the the ground plane motionparameters are obtained using approximate knowledge aboutthe camera calibration and speed. Using these parameters, oneof the frames is warped toward another frame to compensatethe motion of the ground plane. However, the motion of fea-tures having independent motion or height above the groundplane is not fully compensated. To detect these features, thenormalized image difference between the two images is com-puted using temporal and spatial gradients. This suppresses thefeatures on the ground plane and enhances the objects of inter-est. Morphological and other postprocessing is performed tofurther suppress the ground features resulting from any resid-

Fig. 1. Event detection and recording system based on egomotioncompensation from a moving platform: block diagram

ual motion and get the positions of the objects. The detectedobjects are then tracked over frames to form events.

However, the calibration and speed of the camera may notbe known accurately. Furthermore, the camera could vibrateduring the motion. For this reason, the motion of the groundplane may not be fully compensated, leading to misses andfalse alarms. In order to improve the performance, motion pa-rameters are iteratively corrected using the spatial and tempo-ral gradients of the motion-compensated images using opticalflow constraint in a coarse to fine framework. The motion in-formation contained in these gradients is optimally combinedwith the prior knowledge of the motion parameters using aBayesian framework similar to [29]. Robust estimation is usedto separate the ground plane features from other features. Thefollowing sections deal with the individual blocks describedabove, along with the appropriate formulation for ODVS.

3 ODVS motion transformations

To compensate the motion of the omnidirectional camera, thetransformation due to ODVS should be combined with thatdue to motion. These transforms are discussed below.

Table 1. Comparison between corner-based and gradient-based motion estimation

Corner-based methods Gradient-based (direct) methodsDetermines motion of individual features Fits motion model to entire or part of sceneOnly cornerlike features used Edge features used (more numerous)Easier to track over frames More difficult to track over framesEasier to identify outliers Outlier removal more difficultEasier to implement More difficult to implement

T. Gandhi, M.M. Trivedi: Motion analysis for event detection and tracking with a mobile omnidirectional camera 133

a

b c

Fig. 2. a Omnidirectional vision sensor (ODVS). b A typical imagefrom an ODVS. c Transformation to a perspective plan view

3.1 Flat-plane transformation

The ODVS used in this work consists of a hyperbolic mirrorand a camera placed on its axis. It belongs to a class of camerasknown as central panoramic catadioptric cameras [3]. Thesecameras have a single viewpoint that allows the image to besuitably transformed to obtain perspective views. Figure 2ashows a photograph of an ODVS mirror. An image from acamera mounted with an ODVS mirror is shown in Fig. 2b.It is seen that the camera covers a 360◦ field of view aroundits center. However, the image it produces is distorted withstraight lines transformed into curves. A flat-plane transfor-mation is applied to the image to produce a perspective viewlooking downwards as shown in Fig. 2c, where the distortionis considerably reduced. Details of this transformation are dis-cussed below.

The geometry of a hyperbolic ODVS is shown in Fig. 3.According to the mirror geometry, a light ray from the objecttoward the viewpoint at the first focus O is reflected so thatit passes through the second focus, where a conventional rec-tilinear camera is placed. The equation of the hyperboloid is

Fig. 3. Omnidirectional camera geometry

given by(Z − c)2

a2 − X2 + Y 2

b2 = 1 ,

where c =√

a2 + b2.Let P = (X, Y, Z)T denote the homogenous coordinates

of the perspective transform of any 3D point λP on ray OP ,where λ is the scale factor depending on the distance of the3D point from the origin. It can be shown [1,22,35] that thereflection in the mirror gives the point −p = (−x,−y)T onthe image plane of the camera using the flat-plane transformF :

F (P ) = p =(

xy

)=

q1

q2Z + q3‖P‖(

XY

), (1)

whereq1 = c2 − a2, q2 = c2 + a2, q3 = 2ac ,

‖P‖ =√

X2 + Y 2 + Z2 .

Note that the expression for image coordinates p is indepen-dent of the scale factor λ. The pixel coordinates w = (u, v)T

are then obtained by using the calibration matrix K of the con-ventional camera composed of the focal lengths fu, fv , opticalcenter coordinates (u0, v0)T , and camera skew s.(

w1

)= K

(p1

)or

uv1

=

fu s u0

0 fv v00 0 1

x

y1

. (2)

This transform can be used to warp an omni image to a planperspective view. To convert a perspective view back to omniview, the inverse flat-plane transform p can be used:

(p1

)=

x

y1

= K−1

u

v1

, (3)

F−1(p) = P =

X

YZ

=

q1x

q1y

q2 − q3√

x2 + y2 + 1

.

(4)It should be noted that the transformation of omni to perspec-tive view involves very different magnifications in differentparts of the image. For this reason, the quality of the imagedeteriorates if the entire image is transformed at one time.Hence, it is desirable to perform motion estimation directly inthe ODVS domain but use the above transformations to mapthe locations to the perspective domain as required.

3.2 Planar motion transformation

To detect objects with motion or height, the motion of theground is modeled using planar motion model [12,27]. LetPA and PB denote the perspective transforms of a point onthe ground plane in the homogenous coordinate systems cor-responding to two positions A and B of the moving camera.These are related by

λBPB = λARPA + D = λA [RPA + D/λA] , (5)


Fig. 4. Transforming a pixel from omni image A to omni imageB using (1) inverse calibration matrix K−1, (2) inverse flat-planetransform F −1, (3) projective transform H for planar motion fromA to B, (4) flat-plane transform F , (5) calibration matrix K

where R and D denote the rotation and translation betweenthe camera positions and λA, λB depend on the distance ofthe actual 3D point. Let the ground plane satisfy the followingequation at the camera position A:

λAKT PA = 1

or1/λA = KT PA .

Substituting the value of 1/λA into Eq. 5, it is seen that PA

and PB are related by a projective transform:

λBPB = λA

[R + DKT

]PA = λAHPA (6)

or PB ≡ HPA within a scale factor. This relation has beenwidely used to estimate planar motion for perspective cameras.

For performing motion compensation using omnidirec-tional cameras, the above projective transform should be com-bined with the flat-plane transform as well as camera calibra-tion matrix to warp every point in one image toward another.The complete transform for warping is shown in Fig. 4.

4 Parametric motion estimation for ODVS

This section describes the main contribution of the paper. Di-rect methods based on image gradients have been applied forestimating the motion parameters for rectilinear cameras [4,24]. Here, the direct method is generalized for ODVS cam-eras. Information from image gradients is combined with thea priori known information about the camera motion and cal-ibration in a Bayesian framework to obtain optimal estimatesof motion parameters for egomotion compensation.

4.1 Use of optical flow constraint

Under favorable conditions, the spatial gradients (gu, gv),the temporal gradient gt, and the residual image motion(∆u, ∆v)T after current motion compensation satisfy the op-tical flow constraint [21]:

gu∆u + gv∆v + gt = 0 . (7)

Fig. 5.Aperture problem. a In the case of an edge, only the componentof motion normal to the edge can be determined. b In the case ofa corner, the aperture problem is avoided, and the motion can beuniquely determined

However, there is only one equation between two unknownsfor each point. For this reason, only the normal flow, i.e., flowin the direction of the gradient, can be determined using a sin-gle point. This is known as the aperture problem and is illus-trated in Fig. 5a. To solve this problem, Lucas and Kanade [31]assumed that image motion is approximately constant in asmall window around every point. Using this constraint, moreequations are obtained using the neighboring points, and thefull optical flow can be estimated using least squares. Suchan estimate is reliable near cornerlike points where a windowhas gradients in different directions. This is as seen in Fig. 5b.This method has been used by Kanade et al. [36] to find andtrack cornerlike features over an image. However, in the caseof ODVS the assumption of uniform optical flow needs to bemodified due to the nonlinear ODVS transform. Daniilidis [9]has generalized the optical flow estimation to ODVS cameras.

However, tThis approach would use the motion informa-tion only on cornerlike features. However, the edge featuresalso have motion information. To use this information, theimage gradients can be used directly to estimate the modelparameters. This approach is known in the literature as thedirect method of motion estimation and has been extensivelyused in obstacle detection using rectilinear cameras [4,24].Usually a linearized version of a projective transform is used:

∆u = a1u + a2v + a3 + a7u2 + a8uv ,

∆v = a4u + a5v + a6 + a7uv + a8v2 .

The expressions of image motion are substituted into the op-tical flow constraint in Eq. 7 to give

gu(a1u+a2v+a3+. . .)+gv(a4u+a5v+a6+. . .)+gt = 0 .

This gives one equation for every point in eight parameters thatcan be solved using linear least squares. Since the quadraticparameters are more sensitive to noise, a six-parameter affinemodel is also used.

4.2 Generalization for ODVS

To apply the motion estimation to ODVS cameras, the nonlin-ear flat-plane transform is used to go from omni to perspectivedomain and back. Since nonlinearity has to be dealt with any-way, the projective transform H is used instead of a linearmodel so that large motions can be handled better. The motionparameters in the projective transform are parameterized as

h =(h1 h2 h3 h4 h5 h6 h7 h8

)T


with

H

H33=

h1 h2 h3

h4 h5 h6h7 h8 1

.

The optical flow constraint equation is satisfied only forsmall image displacements up to 1 or 2 pixels. To estimatelarger motions, a coarse to fine pyramidal framework [25,37]is used. In this framework, a multiresolution Gaussian pyra-mid is constructed for adjacent images in the sequence. Themotion parameters are first computed at the coarsest level,and the image points at the next finer level are warped us-ing the computed motion parameters. The residual motion iscomputed at the finer level, and the process is repeated untilthe finest level is reached. Even within each level, multipleiterations of warping and estimation can be performed.

Let h be the actual value of the motion parameter vectorand h the current estimate. Using the current estimate, thesecond image B is warped toward the first image A to get thewarped image B′. Then, the transformation between A andB′ can be expressed approximately in terms of ∆h = h− h.Let wA = (uA, vA)T be the projection of a point on the planarsurface in image A. Then, the projection wB of the same pointin warped image B′ is a function of wA as well as ∆h, givenusing a composition of operations shown in Fig. 4. The opticalflow constraint between images A and B′ is then given by(

gu gv

)[wbB′ − wA] = −gt + η ,

where η accounts for the random noise in the temporal imagegradient. For N points on the planar surface, the constraintscan be expressed in a matrix form:

∆z = c(∆h) + v ,

where every row i of the equation represents the constraint fora single image point with

ci(∆h) =(gu gv

)i[wbB′(wA; ∆h)− wA]

∆zi = −(gt)i , vi = ηi . (8)

Due to the flat-plane and the projective transforms, the func-tion c(·) is nonlinear. Hence, state estimate h and its covari-ance P are iteratively updated using the measurement updateequations of the iterated extended Kalman filter [2], with Cdenoting the Jacobian matrix of c(·).

P← [γCT R−1C + P−1

−]−1

, (9)

h← h + ∆h = h + P[γCT R−1∆z−P−1

− (h− h−)]

,

(10)where R is the covariance of the temporal gradient measure-ments, h− is the prior value of the state obtained from cameracalibration and velocity, and P− is the prior covariance. Thematrix R is taken as a diagonal matrix to simplify calculations.However, this would mean assuming that the pixel gradientsare independent, which may not really be the case since gra-dients are computed from multiple pixels. Hence, the factorγ ≤ 1 is used to accommodate the interdependence of thegradient measurements.

To compute the Jacobian C, each row Ci is expressedusing the chain rule:

Ci =(

∂ci

∂h

)=(

∂c∂wB

∂wB

∂pB

∂pB

∂PB

∂PB

∂h

)i

, (11)

where PB = (XB , YB , ZB)T , pB = (xB , yB)T , and wB =(uB , vB)T are, respectively, the coordinates of point i in themirror, image, and pixel coordinate systems for camera posi-tion B.

Differentiating Eq. 8 w.r.t. wB gives(∂c

∂wB

)i

=(gu gv

)i

.

The calibration Eq. 2 can be differentiated to obtain(∂wB

∂pB

)i

=(

fu s0 fv

).

The Jacobian of the flat-plane transform is obtained by differ-entiating Eq. 1 at P = PB as(

∂pB

∂PB

)i

=

(∂xB∂XB

∂xB∂YB

∂xB∂ZB

∂yB∂XB

∂yB∂YB

∂yB∂ZB

)i

=1

(q2ZB + q3‖PB‖)i ‖PB‖i

·(

q3xBXB − q1‖PB‖ q3xBYB q3xBZB

q3yBXB q3yBYB − q1‖PB‖ q3yBZB

)i

.

Since the ODVS transforms giving pA and pB do not changeif the homogenous coordinates PA and PB are changed by ascale factor, we can scale the right-hand side of Eq. 6 to give

PB =1

H33HPA =

h1XB + h2YB + h3ZB

h4XB + h5YB + h6ZB

h7XB + h8YB + ZB

.

Taking the Jacobian w.r.t. h = (h1 . . . h8) gives

(∂PB

∂h

)i

=

(XB YB ZB 0 0 0 0 00 0 0 XB YB ZB 0 00 0 0 0 0 0 XB YB

)i

.

4.3 Outlier removal

The estimate given above is optimal only when all points re-ally belong to the planar surface and the underlying noisedistributions are Gaussian. However, the estimation is highlysensitive to the presence of outliers, i.e., points not satisfyingthe ground motion model. These features should be separatedusing a robust method. To reduce the number of outliers, theroad region of interest is determined using calibration infor-mation, and the processing is done only in that region to avoidextraneous features. To detect outliers, an approach similar tothe data snooping approach discussed in [10] has been adaptedfor Bayesian estimation. In this approach, the error residualof each feature is compared with the expected residual co-variance at every iteration, and the features are reclassified asinliers or outliers.


Fig. 6. Hierarchical motion estimation algorithm

If a point zi is not included in the estimation of h, i.e.,is currently classified as an outlier, then the covariance of itsresidual is

Cov[∆zi −Ci∆h

]� R + CiPCT

i .

However, if zin is included in the estimation of h, i.e., iscurrently classified as an inlier, then it can be shown that thecovariance of its residual is given by

Cov[∆zi −Ci∆h

]� R−CiPCT

i < R .

Hence, to classify in the next iteration, the Mahalanobis normof the residual is compared with a threshold τ . For a pointcurrently classified as inlier the following condition is used:[

∆zi −Ci∆h] [

R + CiPCTi

]−1[∆zi −Ci∆h

]< τ .

(12)For a point currently classified as inlier the covariance R isused in practice instead of R−CiPCT

i in order to avoid non-positive definite covariance because of approximations due tononlinearities. This would somewhat increase the probabilityof classifying as an outlier instead of inlier, which is to be onthe safer side.[

∆zi −Ci∆h]R−1

[∆zi −Ci∆h

]< τ (13)

Note that this method is effective only when there is someprior knowledge about the motion parameters; otherwise theprior covariance P− would become infinite. If there is noprior knowledge, robust estimators can be used as in [32]. Themotion parameter estimation algorithm is shown in Fig. 6.

5 Dynamic event detection and tracking

After motion compensation, the features on the ground planewould be aligned between the two frames, whereas those dueto stationary and moving objects would be misaligned. Im-age difference between the frames would therefore enhancethe objects and suppress the road features. However, the im-age difference depends on residual motion as well as on thespatial gradients at that point. In highly textured regions, theimage difference would be large even for small residual mo-tion, and in less textured regions the image difference wouldbe small even for large residual motion. To compensate thiseffect, normalized frame difference [40] is used. This is givenat each pixel by ∑

gt

√g2

u + g2v

k +∑

(g2u + g2

v),

where gu, gv are spatial gradients and gt is the temporal gradi-ent. Constant k is used to suppress the effect of noise in highlyuniform regions. The summation is performed over a K × Kneighborhood of each pixel. In fact, the normalized differenceis a smoothed version of the normal optical flow and hencedepends on the amount of motion near the point. Blobs corre-sponding to object features are obtained using morphologicaloperations.

Nearby blobs are clustered into one, and the cluster cen-troids are tracked from frame to frame by an algorithm sim-ilar to [17]. For tracking, a list containing the frame number,unique ID, position, and velocity of each track is maintained.The list is empty in the beginning. The following steps areperformed to associate the tracks with features:

• At each frame, associate each existing tracks with the near-est feature in a neighborhood window around the track po-sition. Use a Kalman filter [2] to update the track with thefeature. If no feature is found in the neighborhood window,only a time update is performed.• For features not having tracks in their neighborhood, create

a new track out of the feature and update it in the nextframe.• To keep the number of tracks within bounds, delete the

weakest tracks when the number of tracks get too large.• Merge the tracks that are very close to each other, have

nearly the same velocity, and that are therefore assumedto be from the same object.• Display tracks that have survived for a stipulated number

of frames along with the track history.

For each track that survives over a minimum number offrames, the original ODVS image is used to generate a perspec-tive view [22] of the event around the center of the boundingbox.

6 Experimental validation and results

A series of experimental trials was conducted to systemati-cally evaluate the capabilities and performance of the mobilesentry system for event detection and tracking. Three differenttypes of ODVS mountings were utilized to examine the gen-erality and functionalities of the mobile sentry system. Thefirst trial involved a camera on an electric vehicle, the second


Fig. 7. An ODVS camera mounted on an electric cart for a mobilesentry experimental run

was using a mobile robot, and the third involved a walkingperson with a helmet-mounted ODVS. The first experimentwas done most systematically in order to evaluate the perfor-mance. The other experiments are currently in the exploratorystage and more work is required to characterize their perfor-mance. A related application applying a similar approach toan automobile-mounted ODVS is also shown.

6.1 Camera on electric cart

The first experimental trial of the mobile sentry utilized theODVS camera mounted on an electric cart, as shown in Fig. 7.The cart was driven on a campus road at speeds between 2 and7 miles/h (approx. 1 to 3 m/s). The approximate speed of thecart was determined using GPS and used as an a priori motionestimate. It was also observed that the ellipse correspondingto the entire FOV of the ODVS was oscillating, due possiblyto relative vibrations between the camera and the mirror or tothe automatic motion stabilization in the camera. These weresuppressed by estimating the center of the FOV ellipse usinga Hough transform and translating it to a fixed position. Theeffect of the remaining vibrations were suppressed using theparametric motion estimation process.

Figure 8a shows an image from the ODVS video sequence.The estimated parametric motion is shown using red arrows.Figure 8b shows the classification of points into inliers (gray),outliers (white), and unused (black) points. It should be notedthat the outliers are usually identified when the edges are per-pendicular to the motion. When an edge is parallel to themotion, the aperture problem makes it difficult to identify it.The estimation is done using the inlier points only. An imagewith the normalized frame difference between the motion-compensated frames is shown in Fig. 8c. It is seen that theindependently moving car and person stand out whereas thestationary features on the ground are attenuated in spite ofegomotion. Figure 8d shows the bounding boxes around themoving car and person after postprocessing using morpholog-ical operations.

Since the algorithm uses a planar motion model, station-ary objects above the ground induce motion parallax and are

a

b

c

d

Fig. 8. Detection of moving objects from an ODVS mounted onan electric cart. a Estimated parametric motion of ground plane.Parts of the image corresponding to the cart as well as distant ob-jects are excluded from motion estimation process. b Features usedfor estimation. Gray features are inliers, and white features are out-liers. c Motion-compensated difference image. d Postprocessed im-age showing detection of moving car and person. The angle madewith x-axis in degrees is also shown


Table 2. Performance evaluation. The right column shows the groundtruth number of relevant events in the image sequence. The othercolumns show the number and percentage of events detected by thesystem. The last three rows show the number of false alarms due tostationary objects and shadows. Ground truth is not relevant here

Minimum 15 10 Groundtrack length frames frames truthTotal events 14 (74%) 17 (89%) 19– Persons 9 (90%) 9 (90%) 10– Vehicles 5 (55%) 8 (89%) 9Total false alarms 3 4 N.A.– Stationary objects 1 2 N.A.– Shadows 2 2 N.A.

detected if they are sufficiently close to the camera and in-cluded in the region of interest. Figure 9 shows the detectionof a stationary structure as well as a moving person. Only theparts within the region of interest are detected.

The centroids of the detected bounding boxes were trackedover time and the tracks that survived over ten or more frameswere identified. Typical snapshots from these tracks weretaken, and the distortion due to ODVS was corrected to getthe perspective view looking toward the track position as in[22]. Figure 10 show the snapshots from these tracks, detectingthe events.

To evaluate the algorithm performance, the detection re-sults were compared with ground truth obtained by manuallyobserving the video sequence. The performance was com-pared for two different thresholds on the number of frames inwhich a track has to survive to be detected as an event. Table 2shows the detection rate in terms of total number of events(ground truth) and the number of events actually detected.Note that stationary obstacles and shadows are classified as“false alarms” since they are currently not separated from in-dependently moving objects. A lower threshold increases de-tection rate but also increases false alarms. It was observed thattwo events corresponding to a moving person and cart werenot detected at all for the following reasons. The person andcart were quite far and especially the person was small in theimage. Furthermore, the camera vehicle was turning, induc-ing considerable rotational egomotion. Also, the objects werenear the boundary of the region of interest that was analyzed.An image of this person is shown in Fig. 11.

Attributes such as the time, duration, and position of theevents were extracted. The camera position at the event timewas extracted from the onboard GPS. Assuming that the pointnearest to the camera lies on the ground, the event locationwith reference to the camera could be computed. These wereadded to the camera position coordinates to obtain the eventposition. the approximate positions of the camera as well asthe actual event were mapped as shown in Fig. 12. Table 3summarizes some of the events and their attributes.

6.2 Camera on a mobile robot

The second experimental configuration was a robotic platformdesigned in our laboratory. This platform is called the Mo-bile Interactive Avatar (MIA) [19] in which cameras and dis-plays can be mounted on a semiautonomous robot, as shown

a

b

c

d

Fig. 9. Detection of stationary structure in addition to a moving per-son. Note that only the part of the structure within the region ofinterest is detected


a

b

Fig. 10. Captured events with their IDs. a Detected persons and vehi-cles. b Shadows and stationary obstacles currently considered falsealarms

Fig. 11. Original image corresponding to missed events. The movingperson and vehicle were far from the camera and near the regionof interest boundary. Also, the camera vehicle was making a turn,inducing considerable rotational egomotion in the image

in Fig. 13, to interact with people at a distance. The robotwas driven around the corridor of our building with peoplewalking around it. Figure 14 shows the detection of movingpersons in one of the frames. Snapshots of detected peopleare shown in Fig. 15. However, it was noted that the speed ofthe robot was much smaller than that of people, which wouldmean that simpler methods could also yield good results in this

Fig. 12. Map showing the position of the events in red and the cameraposition at that time in blue. The egovehicle track is marked by theyellow line. The event IDs are labeled in black

Fig. 13. Mobile Interactive Avatar: Semiautonomous robotic systemused for evaluating mobile sentry

Table 3. Summarization of event attributes. The left image showsthe original ODVS image at the time of the event, the middle imageshows the output of detection algorithm, and the right image showsthe snapshot of the event corrected for ODVS distortion

Event ID: 65Event time (snapshot): 16:01:08.3Event duration [s]: 3.6Event position [m]: (7.1, -6.6)Camera position [m]: (7.8, -5.3)

Event ID: 109Event time (snapshot): 16:01:40.0Event duration [s]: 1.9Event position [m]: (53.1, 1.8)Camera position [m]: (56.0, -3.1)


a

b

c

d

Fig. 14. Detection of moving persons in an image sequence from amobile robot. a Estimated parametric motion of ground plane. Thepart of the image in the center, which images the camera itself, is notused for estimation. b Features used for estimation. Gray features areinliers, and white features are outliers. c Normalized image differenceafter motion compensation. The moving person is detected but thelines on the ground are not detected. d Postprocessing and trackingoutput. The track of the detected person with the ID is shown by theyellow line

scenario. Also, the height of the robot was small, and people’sfaces could not be effectively captured.

6.3 Helmet-mounted camera

The third experimental study for mobile sentry involved a per-son walking with an ODVS mounted on a helmet as shown in

b

Fig. 15. Some of the interesting events captured by the mobile robot

Fig. 16. ODVS camera mounted on a helmet. This configuration en-ables acquisition of snapshots of surrounding people including theirfaces. However, there is considerable camera rotation due to move-ment of head and body that should be compensated

Fig. 16. In this configuration, the camera height was approxi-mately 2 m, which enabled easy capture of people’s faces. Thespeed of the person with the helmet was comparable to that ofother people. There was also considerable camera rotation dueto head and body movement, which helped to test the algo-rithm in the presence of large rotational egomotions. Figure 17shows the detection of a moving person in one of the frames.To reduce estimation errors, the region of interest for motionanalysis was truncated to remove the camera’s own image, aswell as the objects above the horizon. It is seen that there issignificant rotational motion between two frames. In spite ofthis motion, the moving person is separated from backgroundfeatures such as the lines on the ground. However, if the rota-tional motion is too large, the detection often deteriorates andtracks get split into parts. Some of the snapshots of detectedpeople shown are in Fig. 18.

Unlike the first experiment, the events here consisted ofthe same persons moving around the camera. Also, there wasmore breaking of tracks due to large rotational motion. Hence,instead of counting the number of events detected, the orienta-tion of tracks in each frame was plotted against time in Fig. 19.The identities of the persons were manually recorded and arecolor-coded in the figure. The crosses show the track breaks.It was observed that in a sequence of 5 min (3000 frames at 10frames per second), the persons were tracked when they weresufficiently close to the camera and the rotational motion wasnot very large.

6.4 Vehicle-mounted ODVS

In a related application [15,23], the event detection approachwas applied to an omnicamera mounted on an automobile. The


a

b

c

d

Fig. 17. Detection of a moving person in an image sequence from ahelmet-mounted camera. a Estimated parametric motion of groundplane. There is significant rotational motion that is estimated by thealgorithm. b Features used for estimation. Gray features are inliers,and white features are outliers. The parts of the image in the cen-ter, above the horizon, and those having small image gradients arenot used in estimation. c Normalized image difference after motioncompensation. The moving person is detected but the lines on theground are not detected. d Postprocessing and tracking output. Thetrack of the detected person with the ID is shown by the yellow line.The position coordinates are computed by assuming that the point onthe blob nearest the camera is on the ground

Fig. 18. Some detected events from the helmet-mounted camera

0 50 100 150 200 250 300−6

−4

−2

0

2

4

6

Time (s)

Th

eta

(rad

)

Fig. 19. Time series showing the orientation (theta) of the person withrespect to the camera. Each color corresponds to a different person.Track breaks are marked by a ×

actual vehicle speed, obtained from a CAN bus, was used forthe initial motion estimate.A video sequence of 36,000 frames(20 min) was processed and vehicles on both sides of the carwere detected by an algorithm similar to the one describedabove, as shown in Fig. 20a. The distortion due to omni imag-ing was removed to generate the bird’s-eye view, as shown inFig. 20b. Figure 20c shows the plots of track positions againsttime for a segment of the video.

7 Summary and discussion

This paper described an approach for event detection us-ing egomotion compensation from mobile omnidirectional(ODVS) cameras. It applied the concept of direct motion esti-mation using image gradients to ODVS cameras. The motionof the ground was modeled as planar motion, and the featuresnot obeying the motion model were separated as outliers. Aniterative estimation framework was used for optimally fusingthe motion information in image gradients with a priori knowninformation about the camera motion and calibration. Coarseto fine motion estimation was used and the motion betweenthe frames was compensated at each iteration.A scheme basedon data snooping was used to remove outliers. Experimentswere performed by obtaining image sequences from varioustypes of mobile platforms and detecting events such as movingpersons and automobiles, giving satisfactory results.

For future work, we plan to improve the robustness of thesystem especially for correct localization of large objects. The


a

b

1320 1330 1340 1350 1360 1370 1380−10

−8

−6

−4

−2

0

2

4

6

8

10

741

775

776

800 806

827

836

846

852

858859

862

884

888

890

903

906

912

914

935

943

946

951

968

969987

100

10

time [s]

long

itudi

nal p

ositi

on: Z

[m]

c

Fig. 20. Detection of moving vehicles in an image sequence using anomnidirectional camera mounted on a moving car. The track historyof the vehicle over a number of frames is marked. The track IDand the coordinates in the road plane are also shown. b Bird’s-eyeview generated by removing the omni distortion, showing detectedvehicles and their coordinates. c Plot of the longitudinal positionof vehicle tracks against time. The tracks are color coded as red,yellow, and green according to increasing lateral distance from theego vehicle

algorithm currently detects regions containing edges wheremotion information is significant but does not respond to uni-form areas of large objects. Morphological operations werehelpful in combining the detected regions, but a systematicapproach based on region-based segmentation and clusteringmay be more appropriate for getting accurate localization interms of bounding boxes. It was also observed that the tracksoften got broken due to inaccurate localization of the detectedblobs. We plan to track entire blobs instead of the centroidsto obtain more robust tracks. The events can then be classi-fied into categories such as persons and vehicles using criteria

such as size and shape. Learning-based approaches such as[34] would also be useful for classification.

The method described above is appropriate for sceneswhere the background is predominantly planar and the fore-ground consists of outliers in form of small objects. If the sceneis not that simple, motion segmentation should be performedalong with estimation. In the case of scenes with multiple sta-tionary planar surfaces, the surfaces have the same parametersfor rotation and translation but different planar normals [14].Hence, the egomotion can be parameterized directly in termsof the linear and angular velocity of the camera and the planenormals of each planar surface. An iterative estimation pro-cedure that estimates each planar surface separately but usesthe estimates it obtains for rotation and translation as start-ing points for estimating other planar surfaces could make theprocess more robust to outliers. For example, if the scene con-sists of features far away from the camera, their egomotioncould be considered almost pure rotation, having only threedegrees of freedom. These features could be used to estimatethe approximate rotation [40]. This rotation could be used asan initial estimate for the parts of the scene containing a groundplane in order to determine the full planar motion. This pro-cedure could be combined with a robust motion segmentationmethod such as [32] to automatically separate multiple planarsurfaces.

Alternatively, the motion parameters can be estimated us-ing a bootstrap method from small patches and combine thepatches having motion consistent with the ground plane asdone by Ke and Kanade [28]. For 3D scenes with large varia-tions in depths, structure from motion approach using epipolarconstraint [7] is more appropriate. The plane+parallax methodproposed by Irani andAnandan [24] can also be used for a widevariety of scenes including planar, piecewise planar, and 3D.

To discriminate between independently moving objectsand stationary objects above the ground, the rigidity con-straint [24] could be used in the plane+parallax framework.We plan to generalize the piecewise planar motion segmen-tation as well as plane+parallax methods for use with ODVScameras using nonlinear motion models for complex scenesand independent motion discrimination.

Acknowledgements. We are thankful for the grant awarded by theTechnical Support Working Group (TSWG) of the US Departmentof Defense, which provided the primary sponsorship of the reportedresearch. We also thank our colleagues from the UCSD ComputerVision and Research Laboratory for their contributions and support.

References

1. Achler O, Trivedi MM (2002) Real-time traffic flow analysis us-ing omnidirectional video network and flatplane transformation.In: Workshop on Intelligent Transportation Systems, Chicago,IL, 2002

2. Bar-Shalom Y, Li XR, Kirubarajan T (2001) Estimation withapplications to tracking and navigation. Wiley, New York

3. Benosman R, Kang SB (2001) Panoramic vision: sensors, the-ory, and applications. Springer, Berlin Heidelberg New York

4. Black MJ, Anandan P (1996) The robust estimation of multiplemotions: parametric and piecewise-smooth flow fields. ComputVision Image Understand 63(1):75–104


5. Boult T, Erkin A, Lewis P, Michaels R, Power C, Qian C,Yin W (1998) Frame-rate multi-body tracking for surveillance.In: Proc. DARPA Image Understanding workshop

6. Carlsson S, Eklundh JO (1990) Object detection using model-based prediction and motion parallax. In: European conferenceon computer vision, April 1990, pp 297–306

7. Chang P, Herbert M (2000) Omni-directional structure frommotion. In: IEEE workshop on omnidirectional vision, HiltonHead Island, SC, June 2000. IEEE Press, pp 127–133

8. Cielniak G, Miladinovic M, Hammarin D, Goranson L, Lilien-thal A, Duckett T (2003) Appearance-based tracking of personswith an omnidirectional vision sensor. In: Proc. IEEE workshopon omnidirectional vision

9. Daniilidis K, Makadia A, Bulow T (2002) Image processingin catadioptric planes: Spatiotemporal derivatives and opticalflow computation. In: Proc. IEEE workshop on omnidirectionalvision, June 2002, pp 3–12

10. Danuser G, Stricker M (1998) Parametric model fitting: Frominlier characterization to outlier detection. IEEE Trans PatternAnal Mach Intell 20(2):263–280

11. Dean KL (2003) Smartcams take aim at terrorists. WiredNews, June 2003. http://cvrr.ucsd.edu/press/articles/Wired News Smartcams.html

12. Faugeras O (1993) Three-dimensional computer vision: a geo-metric viewpoint. MIT Press, Cambridge, MA

13. Gandhi T, Devadiga S, Kasturi R, Camps O (2000) Detectionof obstacles using ego-motion compensation and tracking ofsignificant features. Image Vision Comput 18(10):805–815

14. Gandhi T, Kasturi R (2000) Application of planar motion seg-mentation for scene text extraction. In: Proc. international con-ference on pattern recognition, 1:445–449

15. Gandhi T, Trivedi MM (20040) Motion based vehicle surroundanalysis using omni-directional camera. In: Proc. IEEE sympo-sium on intelligent vehicles (in press)

16. Gandhi T, Trivedi MM (2003) Motion analysis of omni-directional video streams for a mobile sentry. In: 1st ACMinternational workshop on video surveillance, Berkeley, CA,November 2003, pp 49–58

17. Gandhi T, Yang MT, Kasturi R, Camps O, Coraor L, McCan-dless J (2003) Detection of obstacles in the flight path of anaircraft. IEEE Trans Aerospace Electron Syst 39(1):176–191

18. Gluckman J, Nayar S (1998) Ego-motion and omnidirectionalcameras. In: Proc. international conference on computer vision,pp 999–1005

19. Hall TB, Trivedi MM (2002) A novel interactivity environmentfor integrated intelligent transportation and telematic systems.In: 5th international IEEE conference on intelligent transporta-tion systems, Singapore, September 2002

20. Haritaoglu, Harwood D, Davis LS (2000) W4: Real-timesurveillance of people and their activities. IEEE Trans PatternAnal Mach Intell 22(8):809–830

21. Horn B, Schunck B (1981) Determining optical flow. In:DARPA Image Understanding workshop, pp 144–156

22. Huang KC, Trivedi MM (2003)Video arrays for real-time track-ing of persons, head and face in an intelligent room. MachVisionAppl 14(2):103–111

23. Huang K, Trivedi MM, Gandhi T (2003) Driver’s view and ve-hicle surround estimation using omnidirectional video stream.In: IEEE symposium on intelligent vehicles, Columbus, OH,June 2003

24. Irani M, Anandan P (1998) A unified approach to moving objectdetection in 2D and 3D scenes. IEEE Trans Pattern Anal MachIntell 20(6):577–589

25. Jahne B, Haußecker H, Geißler P (1999) Handbook of computervision and applications, vol 2. Academic, San Diego, pp 397–422

26. Jogan M, Leonardis A (2000) Robust localization usingpanoramic view-based recognition. In: Proc. international con-ference on pattern recognition, pp 136–139

27. Kanatani K (1993) Geometric computation for machine vision.Oxford University Press, Oxford

28. Ke Q, Kanade T (2003) Transforming camera geometry to a vir-tual downward-looking camera: robust ego-motion estimationand ground-layer detection. In: Proc. IEEE conference on com-puter vision and pattern recognition, June 2003, pp I:390–397

29. Kruger W (1999) Robust real time ground plane motion com-pensation from a moving vehicle. Mach Vision Appl 11:203–212

30. Lawson S (2002) Yes, you are being watched. PC World,27 December 2002. http://www.pcworld.com/news/article/0,aid,108121,00.asp

31. Lucas B, Kanade T (1981) An iterative image registration tech-nique with an application to stereo vision. In: International jointconference on artificial intelligence, pp 674–679

32. Odobez JM, Bouthemy P (1998) Direct incremental model-based image motion segmentation for video analysis. SignalProcess 66:143–145

33. Ramsey D (2003) Researchers work with public agen-cies to enhance super bowl security, 15 February2003. http://www.calit2.net/news/2003/2-4 superbowl.html.

34. Saptharishi M, Hampshire JB, Khosla PK (2000) Agent-basedmoving object correspondence using differential discriminativediagnosis. In: Proc. IEEE conference on computer vision andpattern recognition, June 2000, 2:652–658

35. Shakernia O, Vidal R, Sastry S (2003) Omnidirectional ego-motion estimation from back-projection flow. In: Proc. IEEEworkshop on omnidirectional vision, June 2003

36. Shi J, Tomasi C (1994) Good features to track. In: Proc. IEEEconference on computer vision and pattern recognition, pp 593–600

37. Simoncelli EP (1993) Coarse-to-fine estimation of visual mo-tion. In: Proc. 8th workshop on image and multidimensionalsignal processing, Cannes, France, pp 128–129

38. Sogo T, Ishiguro H, Trivedi M (2001) N-ocular stereo forreal-time human tracking. In: Benosman R, Kang RS (eds)Panoramic vision: sensors, theory, and applications. Springer,Berlin Heidelberg New York, pp 309–396

39. Stauffer C, Grimson WEL (1999)Adaptive background mixturemodel for real-time tracking. In: Proc. IEEE international con-ference on computer vision and pattern recognition, pp 246–252

40. Trucco E, Verri A (1998) Computer vision and applications: aguide for students and practitioners. Prentice-Hall, EnglewoodCliffs, NJ

41. Vassallo RF, Santos-Victor J, Schneebeli HJ (2002) A generalapproach for egomotion estimation with omnidirectional im-ages. In: Proc. IEEE workshop on omnidirectional vision, June2002, pp 97–103

42. Winters N, Gaspar J, Lacey G, Santos-Victor J (2002) Omni-directional vision for robot navigation. In: Proc. IEEE workshopon omnidirectional vision, pp 21–28

43. Zhu Z, Karuppiah D, Riseman E, Hanson A (2003) Omni-directional vision for robot navigation. Robot Automat Mag (inpress)

Documents

Motion analysis for event detection and tracking with a …cvrr.ucsd.edu/publications/2004/paper_acmj.pdfDigital Object Identiﬁer (DOI) 10.1007/s00530-004-0146-3 Multimedia Systems