22

Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Embed Size (px)

Citation preview

Page 1: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Advances in Cooperative Multi-Sensor Video Surveillance�

Takeo Kanade, Robert T. Collins and Alan J. LiptonRobotics Institute, Carnegie Mellon University, Pittsburgh, PA

e-mail: fkanade,rcollins,[email protected]: http://www.cs.cmu.edu/�vsam

Peter Burt and Lambert WixsonThe Sarno� Corporation, Princeton, NJe-mail: fpburt,lwixsong@sarno�.com

Abstract

Carnegie Mellon University (CMU) and theSarno� Corporation (Sarno�) are performingan integrated feasibility demonstration of VideoSurveillance and Monitoring (VSAM). The ob-jective is to develop a cooperative, multi-sensorvideo surveillance system that provides contin-uous coverage over battle�eld areas. Signi�cantachievements have been demonstrated duringVSAM Demo I in November 1997, and in theintervening year leading up to Demo II in Oc-tober 1998.

1 Introduction

The thrust of the VSAM Integrated FeasibilityDemonstration (IFD) research program is co-operative multi-sensor surveillance to supportenhanced battle�eld awareness [Kanade et al.,1997]. We are developing automated video un-derstanding technology that will enable a sin-gle human operator to monitor activities over acomplex area using a distributed network of ac-tive video sensors. Our advanced Video Under-standing technology can automatically detectand track multiple people and vehicles withincluttered scenes, and monitor their activitiesover long periods of time. Human and vehi-cle targets are seamlessly tracked through theenvironment using a network of active sensorsto cooperatively track targets over areas thatcannot be viewed continuously by a single sen-sor alone. The idea is to allow a commander

�Funded by DARPA contract DAAB07-97-C-J031.

or military analyst to tap into a network of sen-sors deployed on and over the battle�eld to get abroad overview of the current situation. Othermilitary and law enforcement applications in-clude providing perimeter security for troops,monitoring peace treaties or refugee movementsfrom unmanned air vehicles, providing securityfor embassies or airports, and staking out sus-pected drug or terrorist hide-outs by collectingtime-stamped pictures of everyone entering andexiting the building.

Keeping track of people, vehicles, and their in-teractions, over a chaotic area such as the bat-tle�eld, is a di�cult task. Populating the bat-tle�eld with digital sensing units can providethe commander with up-to-date sensory feed-back leading to improve situational awarenessand better decision making. The role of VSAMvideo understanding technology in achievingthis goal is to automatically \parse" people andvehicles from raw video, determine their geolo-cations, and automatically insert them into adynamic scene visualization. We have devel-oped robust routines for detecting moving ob-jects using a combination of temporal di�erenc-ing and template tracking [Lipton et al., 1998]

(in this proceedings). Detected objects are clas-si�ed into semantic categories such as human,human group, car, and truck using shape andcolor analysis, and these labels are used to im-prove tracking using temporal consistency con-straints. Further classi�cation of human ac-tivity, such as walking and running, has alsobeen achieved [Fujiyoshi and Lipton, 1998] (in

Page 2: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

this proceedings). Geolocations of labeled en-tities are determined from their image coordi-nates using either wide-baseline stereo from twoor more overlapping camera views, or intersec-tion of viewing rays with a terrain model frommonocular views [Collins et al., 1998] (in thisproceedings.) An airborne surveillance plat-form has been incorporated into the system, andnovel real-time algorithms for camera �xation,sensor multi-tasking, and moving target detec-tion have been developed for this moving plat-form [Wixson et al., 1998]. Resulting target hy-pothesis information from all sensor processingunits (SPUs), including target type and trajec-tory, are transmitted as symbolic data packetsback to a central operator control unit (OCU),where they are displayed on a graphical userinterface to give a broad overview of scene ac-tivities.

This paper provides an overview of VSAMresearch at Carnegie Mellon University andthe Sarno� Corporation, drawing upon resultsachieved during VSAM Demo I held on Novem-ber 12 1997 at Bushy Run, and preliminary re-sults from the new VSAM IFD testbed systemthat will be unveiled during Demo II at CMU onOctober 8, 1998. Section 2 describes the compo-nents and infrastructure underlying the currentVSAM testbed system, while Section 3 detailsthe video understanding technologies that pro-vide core system functionality. Section 4 out-lines a key idea of our research, namely the useof geospatial site models to enhance system per-formance. The role of the human operator inachieving battle�eld awareness also can not beoverstated { our intention is not to design a fullyautomated, stand-alone system, but rather toprovide a human commander with timely infor-mation regarding events unfolding on the bat-tle�eld. To this aim, a graphical user interfacefor visualizing large scale, multi-agent eventsis a vital system component, and the relevanthuman-computer interface issues are discussedin Section 5. Finally, Section 6 provides a road-map of where we have been and where we aregoing by outlining the signi�cant technologicalaccomplishments that we have achieved duringVSAM Demo I, and that are planned for DemosII and III.

2 VSAM Testbed System

The current VSAM testbed system is evolvingalong the path outlined in the 1997 VSAM PIreport [Kanade et al., 1997]. The system con-sists of a central operator control unit (OCU)which receives video and ethernet data from re-mote sensor processing units (SPUs) (Figure 1).The OCU is responsible for integrating symbolicand video information accumulated by each ofthe SPUs and presenting it in a concise, mean-ingful form to users operating VSAM visualiza-tion tools. A central graphical user interface(GUI) is used to task the system by specifyingwhich areas, targets and events are worthy ofspecial attention.

2.1 Sensor Processing Units (SPUs)

Sensor processing units (SPUs) are the front endnodes of the VSAM network. Their function isto analyze video imagery for the presence of sig-ni�cant entities or events and transmit that in-formation to the OCU. The notion is that SPUsact as intelligent �lters between sensing devicesand the VSAM network. This arrangement al-lows for many di�erent sensor modalities to beseamlessly integrated into the system. Further-more, performing as much video processing aspossible on-board the SPU reduces the band-width requirements of the VSAM network. Fullvideo signals do not need to be transmitted;only symbolic data extracted from video signals.

The VSAM testbed can handle a wide vari-ety of sensors and sensor platforms. SPUscan vary from simple sensors with rudimentaryvision processing capabilities to very sophisti-cated sensing systems capable of making intelli-gent inferences about activities in their �eld ofregard. SPUs can be connected to the system byradio links, serial lines, wireless ethernet, cables,or any other medium. The 1997 demo saw theintegration of monochrome CCD sensors withan airborne color CCD sensor. In 1998, the listof SPUs includes:

� Five �xed-mount color CCD sensors with vari-able pan, tilt and zoom control, a�xed to build-ings around the CMU campus.

Page 3: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Airborne SPU

Relocatable SPU

Graphical user interface

Operator Control Unit

Visualization tools

Fix-mounted SPU

Figure 1: Overview of the VSAM testbed system.

� One van-mounted relocatable SPU that canbe moved from one point to another during asurveillance mission.

� A FLIR Systems camera turret mounted onan aircraft.

� A Columbia-Lehigh CycloVision ParaCamerawith a hemispherical �eld of view.

� A Texas Instruments (TI) indoor surveillancesystem, which after some modi�cations is capa-ble of directly interfacing with the VSAM net-work.

Logically, all of these SPUs are treated iden-tically. They di�er only in the type of phys-ical connection required to the OCU. In fu-ture years, it is hoped that other VSAM sen-sor modalities will be added, including thermalinfra-red sensors, multi-camera omnidirectionalsensors, and stereo sensors.

2.2 Airborne SPU

The airborne SPU warrants further discus-sion. The sensor and computation packages aremounted on a Britten-Norman Islander twin-engine aircraft operated by the U.S. Army NightVision and Electronic Sensors Directorate. TheIslander, shown in Figure 2 is equipped with a

Figure 2: The Night Vision and ElectronicSensors Directorate Islander aircraft. Cameraturret is below wing at left.

Figure 3: The Sarno� airborne camera simu-lator.

Page 4: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

FLIR Systems Ultra-3000 turret that has twodegrees of freedom (pan/tilt), a GPS systemfor measuring position, and an AHRS (AttitudeHeading Reference System) device for measur-ing orientation. Video processing is performedusing on-board Sarno� VFE-100 and PVT-200video processing engines [Hansen et al., 1994] inYears 1 and 2, respectively.

Because the cost of testing algorithms on theairplane is high, we have constructed a simu-lated airborne platform using a gantry, shownin Figure 3. The gantry travels in the XYplane with a camera suspended beneath it ona pan/tilt mount. This enables us to obtainquantitative results and simulate airborne per-formance while debugging the airborne VSAMalgorithms.

2.3 Operator Control Unit (OCU)

Figure 4 shows the functional architecture of theVSAM OCU. It accepts video processing resultsfrom each of the SPUs and integrates the in-formation with a site model and a database ofknown targets to infer activities that are of in-terest to the user. This data is sent to the GUIand other visualization tools as output from thesystem. Data about relevant entities and eventsis packaged up and transmitted to the GUI andother visualization tools.

info

sitemodel

info arbitersensortrigger

OCU FUNCTIONAL MODEL

SPU

geolocationtriggers

classificationrecognition

trackingMTD behaviour

analysisfootprint

activity

DBfootprint

target DB

site modelDB

target

dynamic

analysistrajectory

maintenance

modeling

multi-tasking)(handoff,controlsensor

target

infosensor

idleSPU

tracking)loiterer detection

car park monitoringriot monitoring

(HVI,

definitiontrigger

USER

GUI

Figure 4: Functional architecture of theVSAM OCU.

One key piece of system functionality providedby the OCU is sensor arbitration. At any giventime, the system has a number of \tasks" that

may need attention. These tasks are explic-itly indicated by the user through the GUI, andmay include such things as speci�c targets tobe tracked, speci�c regions to be watched, orspeci�c events to be detected (such as a per-son loitering near a particular doorway). Sensorarbitration is performed by an arbitration costfunction. The arbitration function determinesthe cost of assigning each of the SPUs to each ofthe tasks. These costs are based on the priorityof the tasks, the load on the SPU, the visibilityof the tasks, and so on. The system performs agreedy optimization of the cost to determine thebest combination of SPU tasking to maximizeoverall system performance requirements.

2.4 Graphical User Interface (GUI)

Sensors/FOV

Airborne video

Site modelTarget

air sensorfootprint

Target pointcoordinates

Ground video feeds

Figure 5: VSAM Demo I Graphical User In-terface.

One of the technical goals of the VSAM projectis to demonstrate that a single human operatorcan e�ectively monitor a signi�cant area of in-terest. Towards this end, the testbed employs agraphical user interface for scene visualizationand sensor suite tasking. Through this inter-face, the operator can task individual sensorunits, as well as the entire testbed sensor suite,to perform surveillance operations such as gen-erating a quick summary of all target activitiesin the area. The operator may choose to seea map of the area, with all target and sensorplatform locations overlaid on it (a sample ofthe Demo I GUI can be seen in Figure 5).

Page 5: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

2.5 Communication

GUI

SPU

SPU

GUI

GUI

SPUSPU

VIS

VIS

VIS

VISVIS

SPUSPU

SPU

OCU

VISOCU

OCU

SPU

Figure 6: A nominal architecture for expand-able VSAM networks.

Figure 6 depicts a nominal architecture for theVSAM network. It allows multiple OCUs to belinked together, each controlling multiple SPUs.Each OCU supports exactly one GUI throughwhich all user related command and control in-formation is passed. However, data dissemina-tion is not limited to a single user interface, butis also accessible through a series of visualisa-tion nodes (VIS).

There are two independent communication pro-tocols and packet structures supported in thisarchitecture: the Carnegie Mellon UniversityPacket Architecture (CMUPA) and the Dis-tributed Interactive Simulation (DIS) protocols.The CMUPA is designed to be a low band-width, highly exible architecture in which rele-vant VSAM information can be compactly pack-aged without redundant overhead. The conceptof the CMUPA packet architecture is a hierar-chical decomposition. There are six data sec-tions that can be encoded into a packet: com-mand; sensor; image; target; event; and regionof interest (ROI). A short packet header sec-tion describes which of these six sections arepresent in the packet. Within each section it ispossible to represent multiple instances of thattype of data, with each instance potentially con-taining a di�erent layout of information. Ateach level, short bitmasks are used to describethe contents of the various blocks within thepackets, keeping wasted space to a minimum.All communication between SPUs, OCUs andGUIs is CMUPA compatible. The CMUPA pro-tocol speci�cation document is accessible fromhttp://www.cs.cmu.edu/�vsam.

VIS nodes are designed to distribute the outputof the VSAM network to where it is needed.They provide symbolic representations of de-tected activities overlaid on maps or imagery.Information ow to VIS nodes is unidirectional,originating from an OCU. All of this communi-cation uses the DIS protocol, which is describedin detail in [IST, 1994]. An important bene�tto keeping VIS nodes DIS compatible is that itallows us to easily interface with synthetic en-vironment visualization tools such as ModSAFand ModStealth (Section 5).

3 Video Understanding Technologies

3.1 Moving Target Detection

The initial stage of the surveillance problemis the extraction of moving targets from avideo stream. There are three conventional ap-proaches to moving target detection: temporaldi�erencing (two-frame or three-frame) [Ander-son et al., 1985]; background subtraction [Har-itaoglu et al., 1998, Wren et al., 1997]; and op-tical ow (see [Barron et al., 1994] for an excel-lent discussion). Temporal di�erencing is veryadaptive to dynamic environments, but gener-ally does a poor job of extracting all relevantfeature pixels. Background subtraction pro-vides the most complete feature data, but isextremely sensitive to dynamic scene changesdue to lighting and extraneous events. Optical ow can be used to detect independently mov-ing targets in the presence of camera motion;however, most optical ow computation meth-ods are very complex and are inapplicable toreal-time algorithms without specialized hard-ware.

The approach presented here is similar to thattaken in [Haritaoglu et al., 1998] and is an at-tempt to make background subtraction more ro-bust to environmental dynamism. The notion isto use an adaptive background model to accom-modate changes to the background while main-taining the ability to detect independently mov-ing targets.

The �rst of these issues is dealt with by using astatistical model of the background to provide

Page 6: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

a mechanism to adapt to slow changes in theenvironment. For each pixel value pn in the nth

frame, a running average pn and a form of stan-dard deviation �pn are maintained by tempo-ral �ltering. Due to the �ltering process, thesestatistics change over time re ecting dynamismin the environment.

The �lter is of the form

F (t) = et

� (1)

where � is a time constant which can be con�g-ured to re�ne the behavior of the system. The�lter is implemented:

pn+1 = �pn+1 + (1� �)pn�n+1 = �jpn+1 � pn+1j+ (1� �)�n

(2)

where � = � � f , and f is the frame rate. Un-like the models of both [Haritaoglu et al., 1998]

and [Wren et al., 1997], this statistical modelincorporates noise measurements to determineforeground pixels, rather than a simple thresh-old. This idea is inspired by [Grimson and Vi-ola, 1997].

If a pixel has a value which is more than 2�from pn, then it is considered a foreground pixel.At this point a multiple hypothesis approach isused for determining its behavior. A new set ofstatistics (p0; �0) is initialized for this pixel andthe original set is remembered. If, after timet = 3� , the pixel value has not returned to itsoriginal statistical value, the new statistics arechosen as replacements for the old.

\Moving" pixels are aggregated using a con-nected component approach so that individ-ual target regions can be extracted. Transientmoving objects cause short term changes tothe image stream that are not included in thebackground model, but are continually tracked,whereas more permanent changes are (after 3�)absorbed into the background (see Figure 7).

While this class of detection methods is inap-plicable to video streams from moving cameras,it can be employed in a \step and stare" modein which the system can predict the position ofa target and point the camera in such a wayas to reacquire it. Most background subtrac-tion schemes require a period of time in which

(A)

(B)

Figure 7: Example of moving target detectionby dynamic background subtraction.

the scene remains static so that a backgroundmodel can be built. This is unacceptable in avideo surveillance application where motion de-tection must be available to the system imme-diately after a \step and stare" move. In thisimplementation, temporal di�erencing (3-framedi�erencing) is used as a stop-gap measure untilthe background model is stabilized. While thequality of the moving target detection is dimin-ished using this method, it is assumed that sincethe system is in a tracking mode, it has some no-tion of which target it is looking for, and whereit may be located, so this inferior motion detec-tion scheme will be adequate for the system toreacquire the target.

The MTD algorithm is prone to three types oferror: incomplete extraction of a moving ob-ject; erroneous extraction of non-moving pixels;and legitimate extraction of illegitimate targets(such as trees blowing in the wind). Incomplete

Page 7: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

binarization

Dilation(twice) Erosion

Border extraction

Moving target

Figure 8: Target pre-processing. A movingtarget region is morphologically dilated (twice),eroded and then its border is extracted.

targets are partially reconstructed by blob clus-tering and morphological dilation (Figure 8).Erroneously extracted \noise" is removed usinga size �lter whereby blobs below a certain crit-ical size are ignored. Illegitimate targets mustbe removed by other means such as temporalconsistency and domain knowledge. This is thepurview of the target tracking algorithm.

3.2 Target Tracking

One main purpose of the system is to build atemporal model of activity. To do this, indi-vidual objects must be tracked over time. The�rst step in this process is to take the blobsgenerated by motion detection and match themframe-to-frame.

Many systems for target tracking are based onKalman �lters, but as pointed out by [Isard andBlake, 1996], they are of limited use becausethey are based on unimodal Gaussian densi-ties and cannot support simultaneous alterna-tive motion hypotheses. A few other approacheshave been devised; Isard and Blake [Isard andBlake, 1996] present a new stochastic algorithmfor robust tracking which is superior to previ-ous Kalman �lter based approaches; and Bre-gler [Bregler, 1997] presents a probabilistic de-composition of human dynamics to learn andrecognise human beings (or their gaits) in videosequences.

The IFD testbed system uses a much simplerapproach based on a frame-to-frame matchingcost function. A record of each blob is kept

with the following information:

� trajectory (position p(t) and velocity v(t)as functions of time) in image coordinates,

� associated camera calibration parametersso the target's trajectory can be normal-ized to an absolute coordinate system (p̂(t)and v̂(t)),

� the \blob" data as an image chip,

� \blob" size s and centroid c,

� Color histogram h of \blob".

First, the position and velocity of Ti from thelast time step tlast is used to determine a pre-dicted position for Ti at the current time tnow.

p̂i(tnow) � p̂i(tlast) + v̂i(tlast)� (tnow � tlast)(3)

Using this information a matching cost can bedetermined between a known target Ti and cur-rent moving \blob" Rj

C(Ti; Rj) = f(jp̂i�p̂jj; jsi�sjj; jci�cj j; jhi�hj j)(4)

Matched targets are then maintained over time.This method will fail if there are occlusions. Forthis reason, signi�cant targets (chosen by eitherthe user or the system) are tracked using a com-bination of the cost function and adaptive tem-plate matching [Lipton et al., 1998]. Recent re-sults from the system are shown in Figure 9.

3.3 Target Classi�cation

The ultimate goal of the VSAM e�ort is to beable to identify individual entities, such as the\FedEx truck", the \4:15pm bus to Oakland"and \Fred Smith". As a �rst step, entities areclassi�ed into speci�c class groupings such as\humans" and \vehicles". An initial e�ort inthis work [Lipton et al., 1998] used view inde-pendent visual properties to classify entities intothree classes: humans; vehicles; and clutter. Inthe rural environment of the Bushy Run demo,this is an adequate �rst step; however, in theless constrained urban environment of Demo II,

Page 8: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

VEHICLE VEHICLE

HUMAN

HUMAN

VEHICLE

VEHICLE

VEHICLE

Figure 9: Recent results of moving entity detection and tracking showing detected objects andtrajectories overlaid on original video imagery. Note that tracking persists even when targets aretemporarily occluded or motionless.

a more robust method is employed with a largernumber of entity classes.

As mentioned in section 3.1 two di�erent MTDalgorithms might be employed depending on thestability of the background scene. The qual-ity of the motion detection extracted by threeframe di�erencing is inferior to that of dynamicbackground subtraction, so a di�erent classi�-cation metric must be employed.

In the usual case when entities are accuratelyextracted by background subtraction, a neu-ral network approach is used (Figure 10). Theneural network is a standard three-layer net-work which uses a back propagation algo-rithm for hierarchical learning. Inputs to thenetwork are a mixture of image-based andscene-based entity parameters: dispersedness(perimeter2/area (pixels)); area (pixels); appar-ent aspect ratio; and camera zoom. The net-work will output three classes: human; vehicle;

or human group.

When teaching the network that an input entityis a human, all outputs are set to 0.0 except for\human", which is set to 1.0. Other classes aretrained similarly. If the input does not �t any ofthe classes, such as a tree blowing in the wind,all outputs are set to 0.0.

Results from the neural network are interpretedas follows:

if (output > THRESHOLD)

classification = maximum NN output

else

classification = REJECT

In the case when the background model is un-stable, and three frame di�erencing is used todetect moving targets, a di�erent classi�cationcriterion is used. Given the geolocation of theentity, its actual width w and height h in meters

Page 9: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Input Layer (4)

HiddenLayer (16)

Output Layer (3)

Teach pattern

AreaDispersedness

VehicleSinglehuman

1.0 0.0

Multiplehuman

0.0

Aspectratio

Reject target

single human

0.0 0.00.0

Zoommagnification

Target Camera

Figure 10: Neural network approach to targetclassi�cation.

Class Samples % Classi�ed

Human 430 99.5Human group 96 88.5

Vehicle 508 99.4False alarms 48 64.5

Total 1082 96.9

Table 1: Results for classi�cation algorithmson VSAM IFD data

can be determined. Then a heuristic is appliedto these values:

w < 1:1 h 2 [0:5; 2:5] ) humanw 2 [1:1; 2:2] h 2 [0:5; 2:5] ) groupw 2 [2:2; 20] h 2 [0:7; 4:5] ) vehicleELSE ) reject

(5)

The results for this classi�cation scheme aresummarized in table 1.

These classi�cation metrics are e�ective for sin-gle images. One of the advantages of video is itstemporal component. To exploit this, classi�ca-tion is performed on every entity at every frameand the results of classi�cation are kept in a his-togram. At each time step, the most likely classlabel is then chosen as the entity classi�cation,as described in [Lipton et al., 1998].

3.4 Target Recognition

One of the key features of the VSAM IFDtestbed system is the ability to re-acquire a spe-ci�c target in a video image. It may be nec-essary to do this from a single camera whenthe target has temporarily been lost, or betweentwo di�erent cameras when performing hando�.Obviously viewpoint-speci�c appearance crite-ria are not useful, since the new view of thetarget may be signi�cantly di�erent from theprevious view. Therefore, recognition featuresare needed that are independent of viewpoint.

In this system, two such criteria are used: abso-lute trajectory; and color histogram. The �rstis computed by geolocating targets using view-ing ray intersection with a scene model (Sec-tion 4.3), and the second is determined in a nor-malized RGB color space.

The �rst step in recognition is to predict theposition in which the target is likely to appear.This is done using equation 3. After this, can-didate motion regions are tested by applying amatching cost function. The form of the costfunction is similar to equation 4, but with fewerparameters.

Creacquire = f(jp̂i � p̂jj; jhi � hjj) (6)

3.5 Activity Analysis

Using video in machine understanding has re-cently become a signi�cant research topic. Oneof the more active areas is activity understand-ing from video imagery [Kanade et al., 1997].Understanding activities involves being able todetect and classify targets of interest and ana-lyze what they are doing. Human motion anal-ysis is one such research area. There have beenseveral good human detection schemes, such as[Oren et al., 1997] which use static imagery. Butdetecting and analyzing human motion in realtime from video imagery has only recently be-come viable with algorithms like P�nder [Wrenet al., 1997] and W 4 [Haritaoglu et al., 1998].These algorithms represent a good �rst step tothe problem of recognizing and analyzing hu-mans, but they still have some drawbacks. In

Page 10: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

general, they work by detecting features (suchas hands, feet and head), tracking them, and �t-ting them to some a priori human model such asthe cardboard model of Ju et al [Ju et al., 1996].

The VSAM IFD proposes the use of the \star"skeletonization procedure for analyzing the mo-tion of targets [Fujiyoshi and Lipton, 1998] -particularly, human targets. The notion is thata simple form of skeletonization which only ex-tracts the broad internal motion features of atarget can be employed to analyze its motion.This method provides a simple, real-time, ro-bust way of detecting extremal points on theboundary of the target to produce a \star"skeleton. The \star" skeleton consists of thecentroid of an entity and all of the local extremalpoints which can be recovered when traversingthe boundary of the entity's image (Figure 11).

0,end

distance

border position

dist

ance

dist

ance

iDFT

LPF

Inverse DFT

0 end

a b c d e

a

b c

d

e

"star" skeleton of the shape

d(i)

d(i)

^

centroid

Figure 11: The boundary is \unwrapped" as adistance function from the centroid. This func-tion is then smoothed and extremal points areextracted.

Figure 12 shows the skeletons of various objects.It is clear that while this form of skeletonizationprovides a sparse set of points, it can neverthe-less be used to classify and analyze the motionof various di�erent types of entity.

3.6 Human motion analysis

One technique often used to analyze the motionor gait of an individual target is the cyclic mo-tion of skeletal components [Tsai et al., 1994].However, in this implementation, the knowl-edge of individual joint positions cannot be de-

(a) Human

(b) Vehicle

(c) Polar bear

video image motion detection skeleton

Figure 12: Skeletonization of di�erent movingtargets. It is clear the structure and rigidityof the skeleton is signi�cant in analyzing targetmotion.

termined in real-time, so a more fundamentalcyclic analysis must be performed. Another cueto the gait of the target is its posture. Usingonly a metric based on the \star" skeleton, it ispossible to determine the posture of a movinghuman. Figure 13 shows how these two prop-erties are simply extracted from the skeleton.The uppermost skeleton segment is assumed torepresent the torso, and the lower left segmentis assumed to represent a leg, which can be an-alyzed for cyclic motion.

θ- +0

x

y

x ,yc c

l ,lx y

(a) (b)

φ

x ,yc c

Figure 13: Determination of skeleton features.(a) � is the angle the left cyclic point (leg) makeswith the vertical, and (b) � is the angle the torsomakes with the vertical.

Figure 14 shows human target skeleton motionsequences for walking and running and the val-ues of �n for the cyclic point. These data were

Page 11: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

acquired in real-time from a video stream withframe rate 8Hz.

frame

11 12 13 14 15 16 17 18 19 20

0.125 [sec]

(a) skeleton motion of a walking person

1 2 3 4 5 6 7 8 9 10

(b) skeleton motion of a running person

11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10

frame

[rad

]

frame

-1

-0.5

0

0.5

1

0 5 10 15 20 25-1

-0.5

0

0.5

1

0 5 10 15 20 25

θ

[rad

(d) leg angle of a running personθ(c) leg angle of a walking personθ

0

0.1

0.2

0.3

0.4

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0 5 10 15 20 25frameframe

[rad

]|φ

|[rad

]|φ

|

(f) torso angle of a running personφ(e) torso angle of a walking personφ

Figure 14: Skeleton motion sequences.Clearly, the periodic motion of �n provides cuesto the target's motion as does the mean valueof �n.

Comparing the average values �n in �gures14(e)-(f) show that the posture of a runningtarget can easily be distinguished from that ofa walking one using the angle of the torso seg-ment as a guide. Also, the frequency of cyclicmotion of the leg segments can also be used todetermine human activity.

Another approach to classifying a moving objectis to determine whether it is rigid or non-rigidby examining changes in its appearance overmultiple frames [Wixson and Selinger, 1998].This is most useful for distinguishing vehiclesfrom humans and animals.

3.7 Airborne Surveillance

Fixed ground-sensor placement is �ne for de-fensive monitoring of static facilities such asdepots, warehouses or parking lots. In thosecases, sensor placement can be planned in ad-vance to get maximum usage of limited VSAMresources. However, the battle�eld is a largeand constantly shifting piece of real-estate, andit may be necessary to move sensors around inorder to maximize their utility as the battle un-folds. While the airborne sensor platform di-rectly addresses this concern, the self-motion ofthe aircraft itself introduces challenging videounderstanding issues.

3.7.1 Airborne Target Tracking

Target detection and tracking is a di�cult prob-lem from a moving sensor platform. The di�-culty arises from trying to detect small blocks ofmoving pixels representing independently mov-ing target objects when the whole image is shift-ing due to self-motion (also known as host mo-tion). The key to our success with the air-borne sensor is characterization and removalof host motion from the video sequence usingthe Pyramid Vision Technologies PVT-200 real-time video processor system. As new videoframes stream in, the PVT processor registersand warps each new frame to a chosen refer-ence image, resulting in a cancelation of pixelmovement caused by host motion, and leadingto a \stabilized" display that appears motion-less for several seconds. Airborne target detec-tion and tracking is then performed using three-frame di�erencing after using image alignmentto register frame It�2 to It and frame It�1 to It.This registration is performed at 30 frames/sec.

During stabilization, the problem of moving tar-get detection from a moving platform is ideallyreduced to performing VSAM from a stationarycamera, in the sense that moving objects arereadily apparent as moving pixels in the image.However, under real circumstances some of theremaining residual pixel motion is due to par-allax caused by signi�cant 3D scene structure.This is a subject of on-going research.

Page 12: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

3.7.2 Camera Fixation and Aiming

It is well known that human operators fatiguerapidly when controlling cameras on moving air-borne and ground platforms. This is becausethey must continually adjust the turret to keepit locked on a stationary or moving target. Ad-ditionally, the video is continuously moving, re- ecting the ego-motion of the camera. The com-bination of these factors often leads to operatorconfusion and nausea. We have built upon im-age alignment techniques [Bergen et al., 1992,Hansen et al., 1994] to stabilize the view fromthe camera turret and used the same techniquesto automate the camera control, thereby signif-icantly reducing the strain on the operator. Inparticular, we use real-time image alignment tokeep the camera locked on a stationary or mov-ing point in the scene, and to aim the camera ata known geodetic coordinate for which referenceimagery is available. More details can be foundin [Wixson et al., 1998], this proceedings.

Figure 15 shows the performance of the sta-bilization/�xation algorithm on two groundpoints as the aircraft traverses an approximateellipse over them. The �eld of view in theseexamples is 3�, and the aircraft took approxi-mately 3 minutes to complete each orbit.

3.7.3 Air Sensor Multi-Tasking

Occasionally, a single camera resource must beused to track multiple moving objects, not all ofwhich �t within a single �eld of view. This prob-lem is particularly relevant for high-altitude airplatforms that must have a narrow �eld of viewin order to see ground targets at a reasonableresolution. Sensor multi-tasking is employed toswitch the �eld of view periodically between two(or more) target areas that are being monitored.This process is illustrated in Figure 16 and de-scribed in detail in [Wixson et al., 1998].

4 Site Modeling

We have used site models in the VSAM programboth to improve the human-computer interfaceand to enable various tracking capabilities suchas inferring target positions and camera visibil-

(A) (B)

Figure 15: Fixation on target point A and ontarget point B. The images shown are taken0, 45, 90 and 135 seconds after �xation wasstarted. The large center cross-hairs indicatethe center of the stabilized image, i.e. the pointof �xation

ity. These models have included automatically-generated image mosaics, USGS orthophotos,Digital Elevation Models (DEMs), and CADmodels in VRML.

The OCU site model contains VSAM-relevantinformation about the area being monitored.This includes both geometric and photometricinformation about the scene, represented us-ing a combination of image and symbolic data.Site model representations have intentionallybeen kept as as simple as possible, with aneye towards e�ciently supporting some speci�cVSAM capabilities:

� The site model must primarily support OCUworkstation graphics that allow the operatorto visualize the whole site and quickly compre-hend geometric relationships between sensors,

Page 13: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Figure 16: Footprints of airborne sensor beingautonomously multi-tasked between three dis-parate geodetic scene coordinates.

targets, and scene features.

� The site model provides a geometric contextfor VSAM. For example, we might directly taska sensor to monitor the door of a building forpeople coming in and out, or specify that vehi-cles should appear on roads.

� A 3D site model supports visibility analysis(predicting what what portions of the scene arevisible from what sensors( so that sensors canbe e�ciently tasked.

� An accurate 3D site model supports targetgeopositioning via intersection of viewing rayswith the terrain.

4.1 Coordinate Systems

Two geospatial site coordinate systems are usedinterchangeably within the VSAM testbed. TheWGS84 geodetic coordinate system provides areference frame that is standard, unambiguousand global (in the true sense of the word). Un-fortunately, even simple computations such asthe distance between two points become com-plicated as a function of latitude, longitude andelevation. For this reason, geometric processingis performed within a site-speci�c Local Verti-cal Coordinate System (LVCS) [ASP, 1980]. AnLVCS is a Cartesian system oriented so that thepositive X axis points east, positive Y pointstrue north, and positive Z points up. All thatis needed to completely specify an LVCS is the3D geodetic coordinate of its origin point. Con-version between geodetic and LVCS coordinatesis straightforward, so that each can be used asappropriate to a task.

4.2 Model Representations

Figure 17 illustrates the wide variety of sitemodel representations that have been used ineither the 1997 IFD VSAM demo system or the1998 testbed system.

A) USGS orthophoto. The United StatesGeological Survey (USGS) produces several dig-ital mapping products that can be used to cre-ate an initial site model. These include

� Digital Orthophoto Quarter Quad (DOQQ) -

Page 14: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

a nadir (down-looking) image of the site as itwould look under orthographic projection (Fig-ure 17A). The result is an image where scenefeatures appear in their correct horizontal posi-tions.

� Digital Elevation Model (DEM) - an imagewhose pixel values denote scene elevations at thecorresponding horizontal positions. Each gridcell of the USGS DEM shown encompasses a30-meter square area.

� Digital Topographic Map (DRG) - a digitalversion of the popular USGS topo maps.

� Digital Line Graph (DLG) - vector representa-tions of public roadways and other cartographicfeatures. Many of these can be ordered directlyfrom the USGS EROS Data Center web site, lo-cated at URL http://edcwww.cr.usgs.gov/.The ability to use existing mapping productsfrom USGS or National Imagery and MappingAgency (NIMA) to bootstrap a VSAM sitemodel demonstrates that rapid deployment ofVSAM systems to monitor trouble spots aroundthe globe is a feasible goal.

B) Custom DEM

The Robotics Institute autonomous helicoptergroup has mounted a high precision laser range�nder onto a remote-control Yamaha helicopter.This was used to create a high-resolution (half-meter grid spacing) DEM of the Bushy Run sitefor VSAM DEMO I (Figure 17B). Raw radarreturns were collected with respect to knownhelicopter position and orientation (using on-board altimetry data) to form a cloud of pointsrepresenting returns from surfaces in the scene.These points were converted into a DEM by pro-jecting into LVCS horizontal-coordinate bins,and computing the mean and standard devia-tion of height values in each bin.

C) Mosaics. A central challenge in surveil-lance is how to present sensor information to ahuman operator [Miller and Amidi, 1998]. Therelatively narrow �eld of view presented by eachsensor makes it very di�cult for the operator tomaintain a sense of context that enables him orher to know just what lies outside the camera'simmediate image. Image mosaics from movingcameras overcome this problem by providing ex-

tended views of regions swept over by the cam-era. Figure 17C displays an aerial mosaic ofthe Demo I Bushy Run site. The video sequencewas obtained by ying over the demo site whilepanning the camera turret back and forth andkeeping the camera tilt constant[Hansen et al.,1994, Sawhney and Kumar, 1997, Sawhney et

al., 1998]. The VSAM IFD team also demon-strated coarse registration of this mosaic with aUSGS orthophoto using a projective warp to de-termine an approximate mapping from mosaicpixels to geographic coordinates. It is feasiblethat this technology could lead to automatedmethods for updating existing orthophoto in-formation using fresh imagery from a recent y-through. For example, seasonal variations suchas fresh snowfall (as in the case of VSAM DemoI) can be integrated into the orthophoto.

D) VRML models. Figure 17D shows aVRML model of one of the Bushy Run build-ings and its surrounding terrain. This modelwas created by the K2T company using the fac-torization method [Tomasi and Kanade, 1992]

applied to aerial and ground-based video se-quences. Another use of VRML models in theVSAM system is to display spherical mosaicswith the user immersed at the location of thefocal point (more on this below).

E) Compact Terrain Data Base (CTDB).

Recently, we have begun to use the CompactTerrain Data Base (CTDB) to represent full sitemodels. The CTDB was originally designed torepresent large expanses of terrain within thecontext of advanced distributed simulation, andhas been optimized to e�ciently answer geo-metric queries such as �nding the elevation at apoint in real-time. Terrain can be representedas either a grid of elevations, or as a Triangu-lated Irregular Network (TIN), and hybrid databases containing both representations are al-lowed. The CTDB also represents relevant car-tographic features on top of the terrain skin,including buildings, roads, bodies of water, andtree canopies. Figure 17E shows a small portionof the Schenley Park / CMU campus CTDBcurrently being generated for the 1998 VSAMdemo. In addition to determining object ge-olocation by intersecting viewing rays with theground, we are using the CTDB to perform oc-

Page 15: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Figure 17: A variety of site model representations have been used in the VSAM IFD testbedsystem: A) USGS orthophoto; B) custom DEM; C) aerial mosaic; D) VRML model; E) CTDB sitemodel; and F) spherical representations.

Page 16: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

clusion analysis by determining inter-visibilityof one point from the viewpoint of another. An-other important bene�t to using CTDB as asite model representation for VSAM process-ing is that it allows us to easily interface withsynthetic environment tools ModSAF and Mod-Stealth.

F) Spherical Representations.

Everything that can be seen from a stationarycamera can be represented on the surface ofa viewing sphere [Adelson and Bergen, 1991].This is true even if the camera is allowed to panand tilt about the focal point, and to zoom inand out { the image at any given (pan,tilt,zoom)setting is essentially a discrete sample of thebundle of light rays impinging on the camera'sfocal point. This year, the VSAM IFD team ismaking use of this fact to design camera-speci�crepresentations of the geometry and photome-try of the scene (Figure 17F). Spherical lookuptables are being built for each �xed-mount SPUto precompile and store the 3D locations andsurface material types of the points of inter-section of that camera's viewing rays with theCTDB site model. Spherical mosaics are beingproduced in real-time to provide an extendedview of what can potentially be seen from thecamera, again to provide a better sense of con-text to the human observer. Figure 18 dis-plays a spherical mosaic constructed by panningand tilting a stationary camera mounted on arooftop at the VSAM Demo II site. This rep-resentation can be displayed using a VRML orRealSpace viewer with the observer situated atthe center of the sphere for a realistic surround-video e�ect. The ultimate goal is to displayextracted moving entities in real-time superim-posed on this extended spherical background.

4.3 Model-based Geolocation

One example of model-based VSAM is compu-tation of target geolocation from a monocularview. We believe the key to coherently integrat-ing a large number of target hypotheses frommultiple widely-spaced sensors is computationof target spatial geolocation. In regions wheremultiple sensor viewpoints overlap, object tra-

Figure 18: Spherical mosaic from a camera atthe VSAM Demo II site.

jectories can be determined very accurately bywide-baseline stereo triangulation. However, re-gions of the scene that can be simultaneouslyviewed by multiple sensors are likely to be asmall percentage of the total area of regard inreal outdoor surveillance applications, where itis desirable to maximize coverage of a large areagiven �nite sensor resources.

Figure 19: Estimating object geolocations byintersecting target viewing rays with a terrainmodel.

Determining target trajectories from a singlesensor requires domain constraints, in this casethe assumption that the object is in contactwith the terrain. This contact location is es-timated by passing a viewing ray through the

Page 17: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

bottom of the object in the image and inter-secting it with a model representing the ter-rain (see Figure 19). Sequences of location es-timates over time are then assembled into con-sistent object trajectories. Previous uses of theray intersection technique for object localizationin surveillance research have been restricted tosmall areas of planar terrain, where the relationbetween image pixels and terrain locations is asimple 2D homography [Bradshaw et al., 1997,Flinchbaugh and Bannon, 1994, Koller et al.,1993]. This has the bene�t that no cameracalibration is required to determine the back-projection of an image point onto the sceneplane, provided the mappings of at least fourcoplanar scene points are known beforehand.However, the VSAM testbed is designed formuch larger scene areas that may contain signif-icantly varied terrain. We perform geolocationusing ray intersection with full terrain modelsprovided by either digital elevation maps, or thecompact terrain database. See [Collins et al.,1998] in this proceedings for more details.

5 Human-Computer Interface

Keeping track of people, vehicles, and their in-teractions, over a chaotic area such as the bat-tle�eld, is a di�cult task. The commander obvi-ously shouldn't be looking at two dozen screensshowing raw video output { that amount ofsensory overload virtually guarantees that in-formation will be ignored and would requirea prohibitive amount of transmission band-width. Our approach is to provide an interac-tive, graphical visualization of the battle�eld byusing VSAM technology to automatically placedynamic agents representing people and vehi-cles into a synthetic view of the environment.

This approach has the bene�t that visualizationof the target is no longer tied to the original res-olution and viewpoint of the video sensor, sincea synthetic replay of the dynamic events canbe constructed using high-resolution, texture-mapped graphics, from any perspective. Par-ticularly striking is the amount of data com-pression that can be achieved by transmittingonly symbolic georegistered target informationback to the operator control unit instead of raw

video data. Currently, we can process NTSCcolor imagery with a frame size of 320x240 pix-els at 10 frames per second on a Pentium IIPC, so that data is streaming into the systemthrough each sensor at a rate of roughly 2.3Mbper second per sensor. After VSAM processing,detected targets hypotheses contain informationabout object type, target location and velocity,as well as measurement statistics such as a timestamp and a description of the sensor (currentpan, tilt, and zoom for example). Each targetdata packet takes up roughly 50 bytes. If a sen-sor tracks 3 targets for one second at 10 framesper second, it ends up transmitting 1500 bytesback to the OCU, well over a thousandfold re-duction in data bandwidth.

5.1 Benning MOUT SiteExperiment

Ultimately, the key to comprehending large-scale, multi-agent events is a full, 3D immer-sive visualization that allows the human oper-ator to y at will through the environment toview dynamic events unfolding in real-time fromany viewpoint. A geographically accurate 3Dmodel-based visualization can give the comman-der a comprehensive overview of the battle�eldby displaying people and vehicles dynamicallyinteracting in their proper spatial relationshipsto each other and to 3D cartographic featuressuch as buildings and roads.

We envision a graphical user interface based oncartographic modeling and visualization toolsdeveloped within the Synthetic Environments(SE) community. The site model used formodel-based VSAM processing and visualiza-tion is represented using the Compact Ter-rain Database (CTDB). Targets are insertedas dynamic agents within the site model andviewed by Distributed Interactive Simulationclients such as the Modular Semi-AutomatedForces (ModSAF) program and the associated3D immersive ModStealth viewer. We havealready demonstrated proof-of-concept of thisidea at the Dismounted Battle Space BattleLab (DBBL) Simulation Center at Fort Ben-ning Georgia as part of the April 1998 VSAMworkshop. On April 13, researchers from CMU

Page 18: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

set up a portable VSAM system at the BenningMobile Operations in Urban Terrain (MOUT)training site. The camera was set up at thecorner of a building roof whose geodetic coor-dinates had been measured by a previous sur-vey [GGB, 1996], and the height of the cam-era above that known location was measured.The camera was mounted on a pan-tilt head,which in turn was mounted on a leveled tripod,thus �xing the roll and tilt angles of the pan-tilt-sensor assembly to be zero. The yaw angle(horizontal orientation) of the sensor assemblywas measured by sighting through a digital com-pass. After processing several troop exercises,log �les containing camera calibration informa-tion and target hypothesis data packets weresent by FTP back to CMU and processed usingthe CTDB to determine a time-stamped list ofmoving targets and their geolocations. Later inthe week, this information was brought back tothe DBBL Simulation Center at Benning where,with the assistance of colleagues from BDM,it was played back for VSAM workshop atten-dees using custom software that broadcast time-sequenced simulated entity packets to the net-work for display by both ModSAF and Mod-Stealth. Some processed VSAM video data andscreen dumps of the resulting synthetic environ-ment playbacks were shown previously in Fig-ure 20.

5.2 3D Immersive VSAM

The Fort Benning experiment demonstratedthat it is possible to automatically detect andtrack multiple people and vehicles from a VSAMsystem and insert them as dynamic actors in asynthetic environment for after-action review.We are proposing that, with some modi�ca-tions, this process can also form the basis for areal-time immersive visualization tool. First, weare currently porting object geolocation compu-tation using the CTDB onto the VSAM sensorprocessing unit platforms. Estimates of targetgeolocation will be computed within the frame-to-frame tracking process and will be transmit-ted in data packets back to the operator controlworkstation. Secondly, at the operator worksta-tion, incoming object identity and geolocation

data packets will be repackaged on the y intoDistributed Interactive Simulation (DIS) pack-ets that are understood by ModSAF and Mod-Stealth clients. At that point, targets will beviewable within the context of the site modelby the 3D ModStealth viewer.

A real-time immersive visualization systemwould allow us to explore several experimentalhuman-machine interface questions, the mostimportant of which is: can the user get a feelfor what activity is taking place purely by view-ing events within the synthetic environment? Ifnot, what additional information, such as infre-quently updated images or very low-resolutionvideo, is necessary to complete the picture? Canthe operator e�ectively task troops to intercepta moving target on the basis of what he seesin the synthetic environment? Can the 3D syn-thetic environment be used to interactively con-trol a multi-sensor VSAM system? (For exam-ple, the operator could be placed at the same lo-cation of the sensor and allowed to rotate theirviewpoint within the simulated environment {thereby teleoperating the camera pan-tilt unitto point in the same direction.) What mightbe a good approach to specifying 3D regionsof interest while immersed within a simulatedenvironment? We expect to answer some ofthese questions within our third year researchprogram.

6 VSAM Demonstrations

6.1 Demo I : Bushy Run

VSAM Demo I was held at CMU's Bushy Runresearch facility on November 12, 1997, roughlynine months into the program. The VSAMtestbed system consisted of an OCU with twoground-based and one airborne SPU. The demosuccessfully highlighted the following technicalachievements:

� Modeling. A site model was generated us-ing a USGS orthophoto in combination with ahigh resolution DEM generated using an air-borne laser range-�nder. At the time of thedemo, an aerial scene mosaic was generatedfrom the airborne sensor and geo-registeredwith the orthophoto using a projective warp to

Page 19: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Figure 20: Sample synthetic environment visualizations of data collected at the Benning MOUTsite. A) Automated tracking of three human targets. B) ModSAF 2D orthographic map displayof estimated geolocations. C) Tracking of a soldier walking out of town. D) Immersive, texture-mapped 3D visualization of the same event, seen from a user-speci�ed viewpoint.

hand-selected tie points.

� Ground SPUs. Two ground sensors coop-eratively tracked a car as it entered the BushyRun site, parked and let out two occupants. Thetwo pedestrians were detected and tracked asthey walked around and then returned to theircar. The system continued tracking the car as itcommenced its journey around the site, handingo� control between cameras as the car left the�eld of view of each sensor. All entities were de-tected and tracking using temporal di�erencingmotion detection and correlation-based track-ing. Targets were classi�ed into \vehicle" or\human" using a simple image-based property(aspect ratio) in conjunction with a temporal

consistency constraint. Target geolocation wasaccomplished by intersection of back-projectedviewing rays with the DEM terrain model. Asynopsis of the vehicle trajectory is shown inFigure 21.

� Air SPU. The airborne platform showcasedreal-time, frame-to-frame a�ne motion estima-tion and image warping using the Sensar VFE-100 video processing system (precursor to PVT-200). Speci�c capabilities demonstrated were�xating on a designated ground point whilecompensating for airplane movement and vi-bration using electronic stabilization and activecamera control, multi-tasking the airborne sen-sor to servo between multiple designated areas

Page 20: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

Figure 21: Synopsis of vehicle trajectory dur-ing the Bushy Run demo.

so that the human operator could monitor sev-eral regions simultaneously, and limited mov-ing target detection in open areas using tempo-ral di�erencing on electronically stabilized videoframes.

� OCU. The operator control unit featureda 2D graphical user interface that displayedall sensor positions, �elds of view, and targetlocations overlayed on the orthophoto, alongwith live microwave video feeds from all sensors(shown previously in Figure 5). The GUI alsoallowed a human operator to task the groundsensors by selecting regions of interest and loca-tions for sensor hand-o�, and to directly controlsensor pan/tilt and various control algorithmparameters.

6.2 Demo II : CMU

Demo II will be held on the campus of CMU,a much more urban environment than eitherBushy Run or the Benning MOUT site. Fig-ure 22 shows the placement of sensors andthe OCU. IFD sensor assets include �ve �xed-mount pan-tilt-zoom cameras and the Islanderairborne sensor. These will be augmented withtwo FRE sensor packages provided by Lehigh-Columbia (ParaCamera) and TI (video alarmsystem). These sensors will all cooperate totrack a vehicle from nearby Schenley Park ontocampus, follow its path as it winds through cam-pus and stops at the OCU building, alert the op-erator when the vehicle's occupants attempt tobreak into the building, and follow the ensuing

Figure 22: Overview of sensor placement andOCU location for VSAM Demo II, to be heldon the campus of CMU on October 8, 1998.

car and foot chases as the vehicle and its oc-cupants attempt to ee from the police. Somepreliminary moving object detection, trackingand geolocation results from the VSAM DemoII testbed system were shown in Figure 9. Thetestbed system and current video understandingtechnologies were presented in Sections 2 and 3.

6.3 Demo III : CMU

The VSAM IFD e�ort has been funded for athird year. Demo III will also be held on theCMU campus, to take advantage of the VSAMinfrastructure now in place. This section out-lines planned goals for the third year e�ort.

System Architecture: A second OCU withadditional cameras will be added to the farside of campus to begin exploring issues in dis-tributed VSAM. One or two thermal sensors willbe installed to provide continuous day/nightsurveillance and to gain experience in targetclassi�cation and recognition from thermal im-agery.

Sensor Control: We will add more complexground sensor control strategies such as sensormulti-tasking, and the ability to perform unsu-pervised monitoring of limited domains such asparking lots.

Video Understanding: We will develop im-proved capabilities for doing VSAM while inmotion, including tracking targets while the

Page 21: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

sensor is panning and zooming, and perform-ing real-time spherical mosaic background sub-traction. Improved representation and classi�-cation of targets based on edge gradients, sur-face markings and internal motion will be ex-plored, as well as analysis of target behaviorsusing gait analysis, crowd ow analysis, and de-tection of human-vehicle interactions. We willalso use domain knowledge of road networks topredict trajectories of vehicles.

User Interaction: Year 3 will see the CTDBcampus site model fully integrated into theVSAM testbed system. We will also pursue thegoal of 3D immersive graphical user interfacesfor operator visualization and sensor tasking tothe operator control workstation. Finally, year3 will see the advent of Web-VSAM { remotesite monitoring and SPU control over the inter-net using a JAVA DIS client. Potential sitesbeing evaluated for a VSAM web-cam are theCMU campus and the Benning MOUT site.

Acknowledgments

The authors would like to thank the U.S. ArmyNight Vision and Electronic Sensors DirectorateLab team at Davison Air�eld, Ft. Belvoir, Vir-ginia for their help with the airborne operations.We would also like to thank Chris Kearns andAndrew Fowles for their assistance at the FortBenning MOUT site, and Steve Haes and JoeFindley at BDM/TEC for their help with thethe CTDB site model and distributed simula-tion visualization software.

References

[Adelson and Bergen, 1991] E.H. Adelson andJ.R. Bergen. The plenoptic function and theelements of early vision. In Michael Landyand J.Anthony Movshon, editors, Computa-

tional Models of Visual Processing, Chapter

1. The MIT Press, Cambridge MA., 1991.

[Anderson et al., 1985] C. Anderson, PeterBurt, and G. van der Wal. Change detectionand tracking using pyramid transformationtechniques. In Proceedings of SPIE - Intel-

ligent Robots and Computer Vision, volume

579, pages 72{78, 1985.

[ASP, 1980] American Society of Photogram-metry ASP. Manual of Photogrammetry.Fourth Edition, American Society of Pho-togrammetry, Falls Church, 1980.

[Barron et al., 1994] J. Barron, D. Fleet, andS. Beauchemin. Performance of optical owtechniques. International Journal of Com-

puter Vision, 12(1):42{77, 1994.

[Bergen et al., 1992] J. Bergen, P. Anandan,K. Hanna, and R. Hingorani. Hierarchicalmodel-based motion estimation. In Proceed-

ings of the European Conference on Com-

puter Vision, 1992.

[Bradshaw et al., 1997] K. Bradshaw, I. Reid,and D. Murray. The active recovery of 3dmotion trajectories and their use in predic-tion. PAMI, 19(3):219{234, March 1997.

[Bregler, 1997] C. Bregler. Learning and recog-nizing human dynamics in video sequences.In Proceedings of IEEE CVPR 97, pages 568{574, 1997.

[Collins et al., 1998] Robert Collins, YanghaiTsin, J.Ryan Miller, and Alan Lipton. Us-

ing a DEM to Determine Geospatial Object

Trajectories. CMU technical report CMU-RI-TR-98-19, 1998.

[Flinchbaugh and Bannon, 1994] B. Flinch-baugh and T. Bannon. Autonomous scenemonitoring system. In Proc. 10th An-

nual Joint Government-Industry Security

Technology Symposium. American DefensePreparedness Association, June 1994.

[Fujiyoshi and Lipton, 1998] Hironobu Fu-jiyoshi and Alan Lipton. Real-time humanmotion analysis by image skeletonization. InProceedings of IEEE WACV98, 1998.

[GGB, 1996] Geometric Geodesy Branch GGB.Geodetic Survey. Publication SMWD3-96-022, Phase II, Interim Terrain Data, FortBenning, Georgia, May 1996.

[Grimson and Viola, 1997] Eric Grimson andPaul Viola. A forest of sensors. In Proceed-

Page 22: Adv ances in Co op erativ e Multi-Sensor Video Surv … Video Surv eillance T ak eo Kanade, Rob ert T. Collins and Alan J. ... out sus-p ected drug or ... try, sho wn in Figure 3

ings of DARP - VSAM workshop II, Novem-ber 1997.

[Hansen et al., 1994] M. Hansen, P. Anandan,K. Dana, G. van der Wal, and P. Burt. Real-time scene stabilization and mosaic construc-tion. In Proc. Workshop on Applications of

Computer Vision, 1994.

[Haritaoglu et al., 1998] I. Haritaoglu, Larry S.Davis, and D. Harwood. w4 who? when?where? what? a real time system for detecingand tracking people. In FGR98 (submitted),1998.

[Isard and Blake, 1996] M. Isard and A. Blake.Contour tracking by stochastic propagationof conditional density. In Proceedings of Eu-

ropean Conference on Computer Vision 96,pages 343{356, 1996.

[IST, 1994] Institute for Simulation & Train-ing IST. Standard for Distributed Interactive

Simulation { Application Protocols, Version

2.0. University of Central Florida, Divisionof Sponsored Research, March 1994.

[Ju et al., 1996] S. Ju, M. Black, and Y. Ya-coob. Cardboard people: A parameterizedmodel of articulated image motion. In Pro-

ceedings of International Conference on Face

and Gesture Analysis, 1996.

[Kanade et al., 1997] Takeo Kanade, RobertCollins, Alan Lipton, P. Anandan, and PeterBurt. Cooperative multisensor video surveil-lance. In Proceedings of DARPA Image Un-

derstanding Workshop, volume 1, pages 3{10,May 1997.

[Koller et al., 1993] D. Koller, K. Daniilidis,and H. Nagel. Model-based object trackingin monocular image sequences of road tra�cscenes. IJCV, 10(3):257{281, June 1993.

[Lipton et al., 1998] Alan Lipton, Hironobu Fu-jiyoshi, and Raju S. Patil. Moving targetdetection and classi�cation from real-timevideo. In Proceedings of IEEE WACV98,1998.

[Miller and Amidi, 1998] Ryan Miller andOmead Amidi. 3-d site mapping with the

cmu autonomous helicopter. In Proc. 5th

International Conference on Intelligent

Autonomous Systems. Sapporo, Japan, June1998.

[Oren et al., 1997] M. Oren, C. Papageorgiou,P. Sinha, E. Osuna, and T. Poggio. Pedes-trian detection using wavelet templates. InProceedings of IEEE CVPR 97, pages 193{199, 1997.

[Sawhney and Kumar, 1997] H. S. Sawhneyand R. Kumar. True multi-image alignmentand its application to mosaicing and lens dis-tortion. In Proc. IEEE Conference on Com-

puter Vision and Pattern Recognition, 1997.

[Sawhney et al., 1998] H. S. Sawhney, S. Hsu,and R. Kumar. Robust video mosaicingthrough topology inference and local to globalalignment. In Proc. European Conference on

Computer Vision, 1998.

[Tomasi and Kanade, 1992] Carlo Tomasi andTakeo Kanade. Shape and motion from imagestreams: factorization method. InternationalJournal of Computer Vision, 9(2), 1992.

[Tsai et al., 1994] P. Tsai, M. Shah, K. Ketter,and T. Kasparis. Cyclic motion detection formotion based recognition. Pattern Recogni-

tion, 27(12):1591{1603, 1994.

[Wixson and Selinger, 1998] L. Wixson andA. Selinger. Classifying moving objects asrigid or non-rigid. In Proc. DARPA Image

Understanding Workshop, 1998.

[Wixson et al., 1998] L. Wixson, J. Eledath,M. Hansen, R. Mandelbaum, and D. Mishra.Image alignment for precise camera �xationand aim. In Proc. IEEE Conference on Com-

puter Vision and Pattern Recognition, 1998.

[Wren et al., 1997] C. Wren, A. Azarbayejani,T. Darrell, and Alex Pentland. P�nder: Real-time tracking of the human body. IEEE

Transactions on Pattern Analysis and Ma-chine Intelligence, 19(7):780{785, 1997.