Distributed Interactive Video Arrays for Event Capture and ...cvrr.ucsd.edu/TSWG/papers/DIVA_IS2005.pdf · Distributed Interactive Video Arrays for Event Capture and Enhanced Situational

To appear in IEEE Intelligent Systems, Sept/Oct 2005

Distributed Interactive Video Arrays for Event Capture and Enhanced Situational Awareness

Mohan M. Trivedi, Tarak L. Gandhi, and Kohsia S. Huang

Computer Vision and Robotics Research (CVRR) Laboratory

University of California, San Diego http://cvrr.ucsd.edu/

Abstract Computer vision promises to play a significant role in a wide range of Homeland Security

applications. The objective is to apply computer vision techniques and algorithms under various environmental conditions for security, surveillance, and protection of physical infrastructures under human and vehicular threats. In this paper we provide an overview of a multi camera video surveillance approach, called Distributed Interactive Video Array (DIVA) system. It provides a large-scale, redundant cluster of video streams to observe a remote scene and to supply automatic focus-of-attention with event-driven servoing to capture desired events at appropriate resolutions and perspectives. This paper describes design and deployment of DIVA based systems for vehicle tracking and re-identification, perimeter monitoring, bridge structure monitoring, as well as for people tracking, face detection-recognition, and activity analysis. Deployment of DIVA modules at Superbowl XXXVII demonstrates the practical utility and promise of multi camera arrays for Homeland Security needs.

Keywords: Computer vision, tracking, multi-camera systems, surveillance, visualization.

1 The Role of Computer Vision in Homeland Security

Computer vision has been recognized as one of the key research areas in the Artificial Intelligence since its very beginning. During the seventies through the nineties, computer vision started proving its practical value in a wide range of diverse application domains including medical diagnostics, semiconductor manufacturing, automatic target recognition and smart weapons, remote sensing, and various environmental applications.

It was therefore no surprise that when the very first set of request for proposals (RFP) from

the Combating Terrorism Technology Office of the Technical Support Working Group (TSWG), managed by the U.S. Secretary of Defense was announced on October 23, 2001 (about five weeks after the September 11th attacks), several computer vision related research topics were prominently listed in the new research thrust areas. Research projects were solicited with specific objectives of rapid prototypes to be developed in less than two years [1]. New concepts and systems for diverse needs as listed below were urgently sought:


a) Remote monitoring of real/near time movements of forces and resources. Networked autonomous systems that provide a fused picture of the environment and the movements.

b) Locating faces in video images containing one or more human faces, with special interest in “natural environments” with unconstrained lighting and pose angles.

c) Identifying faces in video images under unconstrained lighting and pose conditions with potential for “real-time” applications.

d) Systems for tracking a single person through multiple sequential video images or through multiple cameras in uncontrolled lighting environments.

e) Terrorist behavior and action prediction technology, to assist the analyst and the identification of patterns, trends, and models of behavior of terrorist groups and individuals. Including visualization and display tools for understanding the relationships between persons, events, and patterns of behavior.

f) Physical security support to protect personnel, equipment, and facilities against terrorist activities.

The newly established Department of Homeland Security (DHS) also recognized the

importance of computer vision field, when one of its very first set of RFPs issued in April 2004, was titled “Automated Scene Understanding”. Many other agencies in the Government, including the National Research Council, encouraged realignment of research agendas and programs to support homeland security applications [2][3]. For example, National Science Foundation sponsored a number of workshops to identify and encourage research in cyber-infrastructure and sensor networks fields [4]. Computer vision was once again identified as an important topic and in the NSF report highlighted the need for developing “ubiquitous vision” with networked and cooperative arrays of cameras.

2 Video Surveillance: Research Overview and Motivation for Multi-camera Systems

The application domain of Computer Vision which is relevant to this paper is that of Video Surveillance. Predictably this area has witnessed a dramatic increase in activity in the past three years. A snapshot of the latest research trends and contributions can be derived by examining the papers appearing in recently established international conferences [5][6] and special issues of major journals on Video Surveillance [7]. These papers show migration of research interest from simple static image based analysis to video based dynamic monitoring and analysis of the scene content. The technical community has started making important contributions to tackle difficult problems associated with illumination, background, color, perspective invariance [5][6]; tracking and analysis of deformable shapes associated with moving human bodies as well as moving cameras [7]; activity analysis and control of multi-camera systems [6][7]. It can be observed that the earlier work on surveillance mostly dealt with single stationary cameras, whereas the recent trend is towards the use of active multi-camera systems. Such systems offer several advantages over single camera systems, such as multiple overlapping views to obtain 3D information and handle occlusions, multiple non-overlapping cameras for wide area coverage, and active pan-tilt-zoom cameras to zoom on to object details. The research discussed in this paper deals with such a distributed array of cameras which allows for wide area monitoring and for analysis scene in multiple level abstractions. Installing multiple sensors introduces several new research issues related to the system design, including handoff schemes for passing tracked objects between


sensors and clusters, methods for determining the “best view” given the context of the scene, and sensor fusion algorithms to best employ the strengths of a given sensor or sensor modality.

Our team has been engaged in a number of research projects dealing with some of the topics listed above. A key element of our activities is the utilization of arrays of video cameras, distributed over a wide area, which can provide multiple levels of semantically meaningful information (“situational awareness”) to match the needs of multiple remote observers. DIVA provides a large-scale, redundant cluster of video streams to observe a remote scene and to supply automatic focus-of-attention with event-driven servoing to capture desired events at appropriate resolutions and perspectives. Key Homeland Security applications where computer vision plays critical role are highlighted in Figure 1, along with the algorithmic and operational requirements. Distributed Interactive Video Array (DIVA) is a framework we propose to effectively support the development and deployment of powerful vision and visualization systems. In this paper, we provide an overview of the DIVA framework, describe experimental results of DIVA for observing roads, bridges and perimeters, present DIVA-based person tracking, face capture, and gesture analysis modules, as well as an integrated situational awareness system. Selected modules of DIVA have been successfully deployed and tested at the Superbowl XXXVII, Coronado Bridge, and roadways of San Diego for almost four years, proving the value of computer vision techniques in Homeland Security.

Pattern and Event Types • Vehicles, aircraft, and ships • Objects • Persons, groups • Face, finger prints, hand geometry,

iris scans, gait • …

Homeland Security Applications with Computer Vision Needs

Algorithmic/Architectural Needs DIVA Features • Tracking • Multi-perspective, distributed

• Physical security • Structural and kinematical modeling • Multilevel abstraction • Infrastructure protection • Gesture and pose analysis • Multimodal • Video surveillance • Activity, behavior, threat

classification • Interactive, customizable

• Biometrics • Privacy & security filters • …

Operational Conditions • Wide areas • Indoors/outdoors • Wide variations in illumination and

environmental conditions • Real/near real-time operation • …

• …

Figure 1: Computer vision for homeland security: application drivers, algorithms, operational conditions, and DIVA-based implementation framework.

Systems similar to those discussed in this paper promise to provide solutions in addressing a

number of specific “threats” encountered in protecting critical infrastructures, national landmarks, or public spaces. For instance, in case of natural or manmade disaster, DIVA type of system can provide exact visual and seismic damage assessment. If a protected site such as


airport runway, port facility, or military base is breached at unauthorized location, DIVA based system can pin-point the exact point of breach, take close up video images of the event, start tracking the vehicle or persons responsible for the breach, and alarm the authorities to take proper actions.

3 Distributed Interactive Video Arrays: Framework and Functionalities

As discussed in the previous section, while single perspective camera based systems served as the most common video capture mechanism, dependence on a single view severely limits the quantity and quality of data available from the viewable environment. Many systems also use single dedicated processor to analyze and record data, and do not provide the ability to distribute processing, select from an array of available sensors, or access real-time or archived data at multiple remote sites. On the contrary, we propose the distributed interactive video array which supports the following capabilities (Figure 2):

a) Distributed Video Networks: To allow complete coverage, the sensors must be placed in a

wide area. The system has televiewing capability, i.e. all sources of information are available through a TCP/IP connection to the distributed computer(s).

b) Active Camera Systems: Exploitation of redundant sensing is mandatory. For this reason, this framework must have one or more central “monitors” to select the camera with the best view of a given area in response to an event. Focus-of-attention in multiple camera systems is a relevant and relatively new research area.

c) Multiple Object Tracking and Handoff: To create a model of the environment and interact with it, the objects in the scene must be detected, segmented and tracked not only in each view but also among different views. This problem is usually referenced as the “camera handoff” problem or the “reidentification” problem.

d) 3-D Localization: Once the object has been detected, tracked in different views, and re-identified, the system should be able to assert where it is in the 3-D world coordinates. 3-D camera coordination in a multicamera system in an effective way is still a challenging research topic.

e) Multisensor Integration: How to exploit information from rectilinear CCD cameras, omnidirectional cameras, and infrared cameras in an integrated and effective way is one of the key objectives of the system.


Figure 2: “Active” video capture and analysis for multi-level situational awareness using DIVA.

To realize the fusion for integrated situational awareness, we have also developed the Networked Sensor Tapestry (NeST) framework for multi-level semantic integration as in Figure 3 [8]. The NeST server is designed to take inputs from the preceding levels for both indoor and outdoor visual analysis modules through network and stamps their timing. Meanwhile, it archives these data into a database. Privacy of the tracked persons is assured in NeST by a set of programmable plug-in privacy filters operating on incoming sensor data to prevent access to or transform data to remove personally identifiable information. These privacy filters are developed and specified using a privacy grammar that can connect multiple low-level data filters and features to create arbitrary data-dependent privacy definitions [8]. A visualization tool, Context Visualization Environment (CoVE), provides the users a 3D virtual reality interface for the on-going activities. CoVE also allows the users to re-play previous records of the surveillance spaces for investigative purposes.

Figure 3: The Networked Sensor Tapestry (NeST) architecture for semantic integration.

Vehicle Tracker #1

Person Tracker #1

Object Association Data Base

CoVE: Visualization Server

Monitor

NeST Server

Privacy Filters

Gesture & Gait Processor

Face Processor #1

Omni Video Array for Localization & Tracking

Tracking

Attributes: Size, Shape, Speed

Classification: Benign/ Suspicious

Event Analysis

Human: Walking, Running

Activity Database

Vehicle: Cars, Trucks

Crowd: Size, Interactions

Activity Analysis: Face and Body

Camera Types PTZ Omni Thermal

Seismic Sensor

Event Reidentification / Hand-Off

Cam 1: Time T

Distributed Video Networks

Static PTZ

Cam 2: Time T+dT “Active” Camera Control

Target Visualization

Static Omni


4 DIVA for Observing Roads, Bridges, and Perimeters

Transportation infrastructure protection is recognized as an important need for Homeland Security [3] that needs to be protected from terrorist attacks, natural disasters as well as continuous degradation due to heavy traffic and elements of nature. Bridges are critical in such infrastructure and their monitoring requires both seismic sensors as well as cameras. Such a multimodal sensory system provides characterization of important patterns associated with structural movements as well as dynamic loads due to vehicular traffic over the bridges. In case of security incidents, it would also be useful to identify and track the same vehicle in a number of cameras spread over a wide area which involves issues of camera handoff and reidentification discussed below.

4.1 Multi-Camera Vehicle Tracking, Handoff, and Reidentification

To extract moving vehicles from a video sequence, it is necessary to identify locations where changes occur in the video scene. A commonly used and computationally inexpensive method of accomplishing this is background subtraction [9], where a background image is generated using several frames of video, and subtracted from the current video image to separate moving foreground objects. The resulting image is processed to extract blobs, and the blobs satisfying size, area, and density constraints are identified as vehicles. To robustly track vehicles over multiple frames, the existing tracks are associated with appropriate blobs. Measured blob positions are combined with track parameters using Kalman filter [9] to improve accuracy. New tracks are generated from unassociated blobs, and tracks unassociated for a certain number of frames are removed.

To perform seamless tracking over multiple cameras, object identity should be maintained

over entire track history. When the fields of view (FOV) of cameras are partially overlapping, we have a handoff problem (similar to handoff in cellular network) where objects leaving one camera can be immediately transferred to the other camera. Figure 4 shows an example of camera handoff. When an object touches any point on the dotted line in one camera, the corresponding point is checked to locate the object in the other camera and the track is passed from the first camera to the second.


Figure 4: Vehicle tracking handoff between cameras with overlapping fields of view. When the camera FOVs are non-overlapping and physically separated by a large distance, we

have the ‘reidentification problem’. This problem is more difficult than camera handoff since an object in one camera could have number of potential matches in other camera, and it may not always be possible to disambiguate all the matches. Huang and Russel [10] provide a probabilistic framework for the vehicle reidentification problem. We use properties of color, size and time of transit between the cameras to match vehicles between cameras. Note that since the proportion of colored vehicles and large-sized vehicles is small, we only select vehicles having sufficiently high color saturation, or a size larger than a threshold in order to avoid false matches. The algorithm for vehicle reidentification is as follows:

• Vehicle Detection: Detect vehicles in upstream and downstream cameras using background subtraction

and extract their snapshots. • Feature Extraction: For all vehicles in both cameras:

o Use K-means clustering to group colors in vehicle pixels, and select the color whose cluster has largest number of pixels.

o Compute second moment of vehicle pixels and estimate vehicle size as the length of major axis of the moments.

o Select vehicles having high enough color saturation or large size. • Matching: For each vehicle in upstream camera:

o Select vehicles in downstream camera that arrived within a window based on expected transit time.

o Compute weighted distance between the upstream vehicle and all selected downstream vehicles. o Assign confidence scores based on weighted distance to each match.

Establish correspondence (1→B) when a track 1 in Camera 1 crosses dotted

line common to the cameras

Camera 1

Camera 2

Label C Label 2

Label 1 Label B


o Select matches with confidence score greater than a threshold, which is adjusted to detect as many matches as possible while keeping the number of false matches to a tolerable level.

Figure 5 shows the experiments with vehicle reidentification from a pair of cameras on

Coronado Bridge belonging to San Diego Traffic Management Center.

Camera 2Camera 1

Distance between cameras: Approx 2 miles

Figure 5: Vehicle reidentification using videos from San Diego Traffic Management Center cameras on Coronado Bridge. (Courtesy: US Geological Survey aerial image, http://terraserver-usa.com).

Detect vehicles

Extracted Matches: Green: 20-40%, Yellow: 40-75%, Red: 75-100 %.

Detect vehicles Extract features such as color, size and time of arrival

Match features using weighted distance, assign confidence measures

http://terraserver-usa.com/


4.2 Sensor Fusion for Infrastructure Health Monitoring

A multimodal sensory system containing cameras as well as seismic sensors provides characterization of important patterns associated with structural movements as well as dynamic loads due to vehicular traffic over the bridges. In this section we discuss how DIVA modules can be incorporated for such multi-modal surveillance and event capture application. A large number of civil structures have been instrumented with various types of sensors for monitoring their structural health. Seismic sensors such as strain gauges and accelerometers can provide temporal signatures of vehicles passing over them, which could be used to extract the weight and the effect of the vehicles on the structure with good accuracy. However, seismic sensors are also sensitive to other natural and artificial phenomena, such as earthquakes, blasts, and external vibrations. Video sensors could be useful for distinguishing these phenomena from normal vehicular traffic. Also, video sensors can give rich information about the shape, size, color, velocity, and track history of vehicles. The information provided by the seismic sensors and video sensors could be combined to improve the reliability of the overall system. Figure 6 shows the block diagram of a bridge-structural health monitoring system under development. Input from multiple cameras as well as seismic sensors can be obtained. Video streams are processed by the vision module to detect and track vehicles and extract their image properties. These can be used in conjunction with the responses from seismic sensors to help in determining the effect of various types of vehicle loads on the bridge.

Figure 6: Block diagram for civil infrastructure monitoring


Figure 7: Civil infrastructure monitoring with camera videos and vibration sensors under the passage. Vehicles are classified with features from both sensors using Linear Discriminant Analysis (LDA).

Figure 7 shows an application for detection of vehicles and extraction of their properties including the vehicle snapshots from video processing, and the responses from the strain gauges for both the directions of traffic. The following properties were recorded and used for classifying the vehicles into buses and cars: (1) Larger of the two peak responses (corresponding to each wheel base) recorded by the strain gauge when vehicle is detected, (2) Time interval between two peak responses, (3) Vehicle blob area obtained from image-based detection, (4) Aspect ratio (height/width) of the vehicle blob. Each of these properties is larger for buses and smaller for cars. These features are combined using Fisher’s Linear Discriminant Analysis (LDA) [11] to find an optimal linear combination of the logarithms (for scale invariance) of

Class 1

Class 2

Video based detection

Seismic sensor based detection

Object size Aspect ratio

Peak strain, Gap between peaks

LDA based classification

Class 1: Buses (red +)

Class 2:

Composite Deck

Class boundary

Cars (blue ×)


these properties that maximizes the variation between the classes and minimizing the variation within each class. The boundary between the two classes is a hyperplane obtained by thresholding this linear discriminant using Bayesian error criterion [11] to minimize the number of classification errors.

4.3 Perimeter Sentry with Active Camera Control

For continuous monitoring of wide areas, it is not always practical to have a person look at the video all the time to identify suspicious activities. It is helpful to have a system for automatically extracting and summarizing such interesting events. An important application of such a system is a perimeter sentry, which guards a pre-configurable monitoring zone or “virtual fence.” Background subtraction [9] is used to detect moving objects such as persons and vehicles which are tracked over frames. Any such track that breaches the virtual fence, such as a person passing in that zone, triggers an alarm that could notify relevant persons. In addition, such an event can also initiate active control of other cameras in the array. For example, using the location of the monitoring zone, a pan-tilt-zoom camera can be made to point to the zone and get finer details of the event.

Figure 8 shows an application of the perimeter sentry. A car entering the protected zone at

the garage triggers another camera that zooms in towards the event and captures a high-resolution video sequence. A face detection module is then used on the high-resolution images to capture intruder’s face. In another scenario, a stalled vehicle triggers another camera to zoom on to the location to observe details such as license plate of the vehicle.

Virtual fence

Figure 8: Active camera control. A security zone can be defined for invasion detection. Another camera can be activated by the event in order to capture a close-up view.


4.4 DIVA Deployment in “Outdoor” Settings

The systems described above have been successfully deployed and tested on our campus and other platforms. Vehicle tracking and traffic parameter estimation has been performed with cameras overlooking the I-5 freeway passing through the campus. Perimeter sentry was demonstrated near the Coronado Bridge where it successfully detected intrusions in the security zone. Multimodal vehicle data extraction with video and seismic sensors is deployed on a campus road with significant traffic, and we are working on deploying similar system on a bridge over I-5 freeway. Multi-camera handoff has been demonstrated with cameras on a campus street. Vehicle reidentification between distant cameras has been applied to the videos obtained from Traffic Management Center cameras on Coronado Bridge and we are working on evaluating and improving the algorithm performance.

In addition, several DIVA modules were successfully deployed in the Superbowl XXXVII,

(Figure 9) [12]. A high resolution thermal camera was mounted near the riverbed beside the stadium in order to detect humans and animals with visually cluttered scenes on a 24-hour basis. Traffic flow analysis was used to monitor the peripheral traffic on Friars Road. An omni-camera in the downtown Gaslamp district was installed to monitor traffic conditions using digital tele-viewer (a software interface unwarping omni video to perspective video on customizable pan-tilt-zoom settings) and estimate the crowd size simultaneously. These surveillance nodes were remotely linked to and controlled by the perimeter sentry command center in Sea Port Village, San Diego, for the city authorities, police, and first responders.

Figure 9: DIVA security network deployment in Superbowl XXXVII in San Diego.

Riverbed-Qualcomm Stadium Gaslamp District:

Party area

Crowd Monitoring

Sea Port Command Center: Perimeter Sentry

Digital Tele-Viewer (DTV)

Friars Road: Live Traffic Flow Notification


5 DIVA for Person Tracking, Event Capture, and Activity Analysis in Indoor Settings

In contrast to the outdoor applications, the indoor DIVA systems utilize multiple types of cameras with highly overlapped FOVs for versatile human-related event and activity analysis. The objectives for such systems involve developing sensor networks that derive multi-level awareness of human activity and identity. As shown in Figure 10, we propose DIVA system for deriving such multilevel semantic description of activities in a room. It includes a video analysis level which processes camera array videos for person segmentation. With both omni and rectilinear pan-tilt-zoom (PTZ) video arrays, the system can obtain a coarse-to-fine awareness of human activities [13]. Next, the localization level detects people and tracks them continuously. Human faces are also captured with PTZ array [13]. Then, the gesture analysis and identification levels derive higher semantic details for human gesture and identity. The integration and visualization level derives the spatial-temporal co-occurrence of the events for high-level activity awareness in order to focus on certain humans with limited resources. This level also archives and visualizes the events in real-time as well as replays them for investigative purposes [8].

Omnidirectional & Rectilinear Video Arrays

Foreground Segmentation & Shadow Elimination

Multi-Person 3D Measurement

Camera Selection &

Control

System Focus Derivation

Head Tracking

Multi-Primitive Feature Extraction:

Skin-Tone & Elliptical Edge

Video-based Face Detection,

Recognition, and Pose Estimation

Visualization

Event Detection &

Archive/Recall

Identification Level

Multi-Person 3D Voxelization

3D Shape Context Acquisition

Gesture Analysis Level

Localization Level

Integration & Visualization Level

Data Association & Track Initialization /

Termination

3D Tracking Filter 3D Gesture Recognition

Omni Video Array

pred.

Video Analysis Level

Figure 10: An architecture of the indoor DIVA system with multi-level visual context abstraction.


5.1 Real-Time 3D Person Tracking, Face Detection and Recognition

For real-time indoor 3D person tracking [13], as indicated by Figure 10, the omni array videos first undergo pixel-level processing that segments the human silhouettes by background subtraction with shadow elimination. The horizontal locations and heights of people are measured from the silhouettes by triangulation with the calibrated omni video array [13]. These 3D measurements of humans are then associated to the existing tracks, and track initialization and termination are decided according to some time constraints. Finally, the Kalman track filters are updated with new measurements in order to output the estimated and predicted track locations.

Upon 3D person tracking, the system actively derives a focus of attention on the subjects in

order to capture their faces, as shown in Figure 10. Knowing the head location of a subject via 3D tracker, a most nearby PTZ camera is chosen to latch on the subject’s face [13]. Next, as shown in Figure 11, skin-tone regions and elliptical edges for face contour are detected from the image in order to find plausible faces. Then, the face candidates are verified with a face classifier and update the face tracks. With this real-time scheme, a face can be detected robustly under challenging environmental conditions.

Face orientation estimation is very useful to assess the focus of attention and intent of the

subject. As in Figure 11, the face video frames are first projected into a facial feature subspace, and the likelihood scores of a face frame associated to various face orientation clusters are computed. These orientation likelihoods are then tracked across frames by a Hidden Markov Model (HMM), whose state sequence equivalents to the final face orientation sequence. For face recognition [13], clusters of different identities are trained in the feature subspace, and the identity likelihoods are accumulated across frames in a video segment to make the final decision. These novel video-based face analysis algorithms surpass single frame-based methods experimentally due to accumulation of confidences over time.

Skin-tone segmentation Fusion of

face-cropping window

Skin-tone segments, edge gradients, and elliptical edge links

Elliptical edge links

Figure 11: Flow chart for video-based face capture, pose estimation, and recognition.

Project into feature

subspace

Orientation likelihoods in the feature subspace

Identity likelihoods in the feature subspace

Tracking with HMM across

frames

Likelihood accumulation across frames per identity

State sequence to orientation look-up-table

Maximum likelihood

among identities

Facial identity of the video segment

Face orientation sequence

Face tracking

Face/non-face classification

Track prediction


Besides face analysis, human activities are also captured by 3D human body gesture analysis. As shown in Figure 10, voxels of human subjects are reconstructed from the array omni videos. A cylindrical 3D shape context descriptor is then scaled upon each subject in order to capture the body configurations. The dynamics of the 3D body configurations or gestures are modeled by a vocabulary of HMMs. Given a gesture sequence, the 3D shape context histograms are vector-quantized and the index sequence goes to the HMM vocabulary to decide the final gesture by maximum likelihood. With this scheme, gesture recognition can be performed robustly with noisy and low-resolution human body voxelization.

5.2 DIVA Deployment in Indoor Settings for Integrated Situational Awareness

In this section we demonstrate the integrated experiments of indoor situational awareness. The real-time 3D tracker is deployed in a room of 6.7m×6.6m with 4 omni cameras, each captures a 640 × 480 video. Tracking accuracy of ~20cm for simultaneous 5 people is obtained with this deployment [13]. For entrance or exit of a person in the room, access zones of the room are defined and displayed in the CoVE interface as shown in Figure 12a. Track data are sent to the NeST server to monitor the zones and archive over long periods of time. By this integration, passing counts of the zones with the track indices are accumulated.

As shown in the dialog box in Figure 10, a PTZ camera is driven by the 3D tracker to capture

the human face upon entrance. The face is detected in ~15fps and identified by the system [13]. Also the face image is attached to the human bounding box in CoVE as shown in Figure 12b. Long-term face archive is shown in Figure 12b. The 3D tracker monitors the room continuously and archives the entering people automatically with a timestamp. It is suitable for visual surveillance and forensic support applications. As an attentive scenario, multiple people are scanned sequentially by the closest PTZ cameras. When a person enters or exits, the system resets the scanning order.

Currently the gesture recognition and video-based face orientation and recognition modules

that involve HMM are implemented in Matlab. Although running offline, they give very promising accuracies [13]. System situational awareness would be enhanced once their C++ implementations are available.


(a)

15:12:31

Pew 15:13:36

Pew 15:14:37 Mohan

15:16:03 Mohan

15:16:57 Mohan

15:18:13 Kohsia

15:19:10 Shinko

15:20:30 Jeff

15:21:12 Steve

15:26:57 Doug

15:27:34

Doug 15:29:10

Tasha 15:30:01

Tasha

15:43:02 Kohsia +

Chris

15:44:02 Chris

15:45:18 Kohsia

15:46:55 Chris

15:54:22 Kohsia +

Doug

15:59:45 Ofer

16:01:44 Kohsia

(b) Figure 12: (a) Long-term zone watch. Zone count changes as the human track passes through it. (b) Automatic tracking-based face capture and archive during ~50 minutes. In multi-person cases, the subjects are captured in turn.

6 Concluding Remarks

Computer vision is recognized to play a significant role in enhancing personal safety and protecting infrastructure and properties within national borders. Remote monitoring of transportation facilities and public spaces as well as automatic notification systems triggered by potentially dangerous events can certainly incorporate vision systems as essential components. It should be recognized that such applications do pose major challenges to the existing and commercially available systems, mainly due to strict requirements of very high detection rates, and almost zero false alarm rates, robustness to environmental variations, distributed and almost ubiquitous coverage, and real-time or near real-time performance. In this paper we presented a multi-camera video surveillance approach, called Distributed Interactive Video Array (DIVA) system. It provides a large-scale, redundant cluster of video streams to observe a remote scene


and to supply automatic focus-of-attention with event-driven servoing to capture desired events at appropriate resolutions and perspectives. DIVA system and modules have been deployed in a number of real-world settings, including the Superbowl XXXVII, Coronado Bridge, roadways of San Diego, and on-campus of UCSD over the years. These deployments and their use by the first responder community, show a promise of computer vision systems for Homeland Security applications.

Acknowledgements

The authors wish to thank the reviewers for their constructive comments. We are thankful to our research sponsors. Primary among them are Technical Support Working Group (TSWG) of the U.S. Department of Defense, NSF Information Technology Research (ITR) Grant for Structural Health Monitoring and NSF ITR Rescue Project. UC Discovery Grants supported development of various experimental research testbeds.

References

[1] USD(AT&L)/TSWG, Board Agency Announcement 02-Q-4665, Oct. 2001. [2] “Making the Nation Safer: The Role of Science and Technology in Countering Terrorism”, National

Research Council Report, US National Academics Press, 2002. [3] H. Chen, F. Wang, and D. Zeng, "Intelligence and Security Informatics for Homeland Security: Information,

Communication, and Transportation," IEEE Trans. on Transportation Systems, vol. 5, no. 4, pp. 329-341, Dec. 2004.

[4] S. Mehrotra et al., “Project Rescue: Challenges in Responding to the Unexpected,” Proc. SPIE, vol. 5304, pp.179-192, Jan. 2004.

[5] Proc. ACM 2nd Int’l Wksp. on Video Surveillance & Sensor Networks, Oct. 2004. [6] Proc. IEEE Conf. on Advanced Video and Signal Based Surveillance, Jul. 2003. [7] Special Issue on Visual Surveillance, Multimedia Systems, vol. 10, no. 2, pp. 116-180, Nov. 2004. [8] D. Fidaleo, R. E. Schumacher, M. M. Trivedi, “Visual Contextualization and Activity Monitoring for

Networked Telepresence,” Proc. ACM 2nd Int’l. Wksp. on Effective Telepresence, pp. 31-39, 2004. [9] D. A. Forsyth and J. Ponce, “Computer Vision: A Modern Approach,” Prentice-Hall, New Jersey, 2003. [10] T. Huang and S. Russell, “Object Identification: A Bayesian Analysis with Application to Traffic

Surveillance,” Artificial Intelligence, vol. 103, pp. 1-17, 1998. [11] R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, Wiley-Interscience, 2nd ed., Oct. 2000. [12] D. Ramsey, “Researchers Work with Public Agencies to Enhance Super Bowl Security,”

http://www.calit2.net/news/2003/2-4_superbowl.html [13] M. M. Trivedi, K. S. Huang, I. Mikić, “Dynamic Context Capture and Distributed Video Arrays for

Intelligent Spaces,” IEEE Trans. on Systems, Man and Cybernetics, Part A, vol. 35, no. 1, pp. 145-163, Jan. 2005.

http://www.calit2.net/news/2003/2-4_superbowl.html


Mohan Manubhai Trivedi is a Professor of Electrical and Computer Engineering and the founding Director of the Computer Vision and Robotics Research Laboratory at the University of California in San Diego. Trivedi has a broad range of research interests in the intelligent systems, computer vision, intelligent (“smart”) environments, intelligent vehicles and transportation systems and human-machine interfaces areas. In close collaboration with regional transportation and first responder agencies, Trivedi regularly participates in projects dealing with infrastructure protection and physical security. He is also active in research in privacy preserving technologies as well as in dialogs related to balancing privacy with security using

video technology at various multidisciplinary forums. Trivedi has received the Distinguished Alumnus Award from the Utah State University, Pioneer Award (Technical Activities) and Meritorious Service Award from the IEEE Computer Society.

Tarak Gandhi received his bachelor of technology degree in Computer Science and Engineering at the Indian Institute of Technology, Bombay. He earned his M.S. and Ph.D. from the Pennsylvania State University in Computer Science and Engineering, specializing in Computer Vision. He worked at Adept Technology, Inc. on designing algorithms for robotic systems. Currently, he is postdoctoral scholar at the Computer Vision and Robotics Research laboratory at the University of California at San Diego. His interests include computer vision, motion analysis, image processing, robotics, target detection, and pattern recognition. He is working on projects involving intelligent driver assistance, motion-based event detection, traffic

flow analysis, and structural health monitoring of bridges.

Kohsia Samuel Huang (M’98-S’00-M’05) is a postdoctoral researcher at the Computer Vision and Robotics Research (CVRR) Laboratory, University of California, San Diego (UCSD). His research interests include multimodal intelligent environments, computer vision, machine learning, and signal processing. He received his Ph.D. degree in electrical engineering at UCSD in March 2005. He is a member of the IEEE.

Documents

Distributed Interactive Video Arrays for Event Capture and ...cvrr.ucsd.edu/TSWG/papers/DIVA_IS2005.pdf · Distributed Interactive Video Arrays for Event Capture and Enhanced Situational