View
215
Download
2
Category
Preview:
Citation preview
Distributed Video Data Fusion, Distributed Video Data Fusion, Analysis, and Mining for Video Analysis, and Mining for Video Surveillance Applications* Surveillance Applications*
Edward Chang2 and Yuan-Fang Wang1
Department of Electrical and Computer Engineering2
Department of Computer Science1
University of CaliforniaSanta Barbara, CA 93106
*Supported in part by NSF Career, ITR, IDM, and Infrastructure grants, and a gift from Proximex Corp.
Problem Statement Video surveillance with
Multiple cameras Mobile, wireless networks Online data processing Intelligent, computer-assisted content analysis
Focus of current work Event Sensing for
detection representation, and Recognition of motion events
Sensor Network Management for Bandwidth and power resource conservation
Potential Applications and Needs
Applications Emergency search and rescue in natural disaster Deterrence of cross-border illegal activities Reconnaissance and intelligence gathering in
digital battlefields Needs
Rapid deployment, dynamic configuration, and continuous operations
Robust and real-time data fusion and analysis Intelligent event modeling and recognition
1x
1y1z
2x
2y
2z
mx
my
mz
X
Y
Z
TtZtYtXt ))(),(),(()( P
Ttytxt ))(),(()( 111 p
Internet
Slave station
Masterstation
Validation Scenario
Research and Development Framework
Event detection Far-field coordination and update Near-field sensor data fusion
Event representation Hierarchical – multiple levels of detail Invariant – insensitive to incidental changes
Event recognition Temporally correlated event signature Imbalanced training set
Event Detection: Near-field Sensor Data Fusion
Sensing coordination and intelligent data fusion
Two-level hierarchy of Kalman filter
Bottom level (feed forward) Summarize trajectories in local
state vectors Merge state vectors from multiple
cameras through registration parameters
Top level (feed backward) Fill in missing or occluded
trajectory pieces Camera pose & frame rate control
)0(
)0(
)0(
)0(
p
p
p
x
P
P
P
X
)()0( tz)()( tiz )()1( tmz
XTxworldreal
)0()0(
)0()0( xTX
realworld
XTxworldreal
mm
)1(
)1(
)1()1(
m
realworld m xTXInternet
Master fusion station
Slave stationSlave station
Slave station
)(
)(
)(
)(
i
i
i
i
p
p
p
x
)1(
)1(
)1(
)1(
m
m
m
m
p
p
p
x
Event Detection: Far-field Coordination and Update
Minimizing Bandwidth and power consumption under pre-specified accuracy constraints
Dual Kalman filters Update necessary only when
predications diverge Cache dynamic algorithms instead of
static data
Event Representation
Hierarchical Multiple levels of description
Syntactic level Semantic level
Invariant Descriptors unaffected by incidental changes of
environmental factors and camera pose Consequences
Be able to perform both “intra-class” and “inter-class” recognition
Recognize syntactic similarity (the same trajectory) and semantic similarity (the same type of trajectory)
Event Representation: Syntactic Level
Normalization against View point (Affine or
perspective) Speed
To derive an invariant signature
Event Representation: Semantic Level Segmentation based on acceleration Segment characterization Markov chain representation
?0P ierP no
?0V oyes constant? r
Stoppedyes no
Constantvelocity
Right spiral
yes no
yes no
Start
constant?
?0V o
?|| oVP
Left half turn
yes no
yes no
Slow down
?oVP
yes no
Right half turn
0)( zoVP
yes no
Right outwardturn
0)( zoVP
yes no
Rightinwardturn
0 oVP 0 oVP
Left outwardturn
Leftinwardturn
yes no yes no
?0V o
0/ dtd
yes no yes no
Right turn
Left turn
yes no
0/ dtd
Left spiral
yes no
Quickaccelerate
0 oVP
yes no
Quickstart
constant?
?0V o
?|| oVP
Left half Turn w.acc
yes no
yes noEmergency stop
?oVP
yes no
Right half turn w.acc
0)( zoVP
yes no
0)( zoVP
yes no
Rightoutwardturn w acc
0 oVP 0 oVP
yesno yes
no
?0V o
0/ dtd
yes no yes no
Left turn w. acc
yes no
0/ dtd
yes no
0/|| dtd r
Right half turn w.decel
yes
0/|| dtd r
Left half Turn w.decel
yesno no
0/|| dtd r
yes
0/|| dtd r
yesno no
0/|| dtd r
yes
0/|| dtd r
yesno no
Rightoutwardturn w decel
Rightinwardturn w acc
Rightinwardturn w decel
Leftoutwardturn w acc
Leftoutwardturn w decel
Leftinwardturn w acc
Leftinwardturn w decel
0/|| dtd r
yes
0/|| dtd r
yesno no
0/|| dtd r
yes
0/|| dtd r
yesno no
Left turn w. decel
Rightturn w. acc
Rightturn w. decel
Left turn w. acc
Left turn w. decel
RightTurn w. acc
Rightturn w. decel
Event Representation: Semantic Level (cont.)
Left half turn
Left half turn w. acc
Left half turn w. decel
Left outwardspiral
Left outward spiral w. acc
Left outward spiral
w. decel
Left inwardspiral
Left inward spiral w. acc
Left inward spiral
w. decelConstant velocity
Speed up
Slow down
Left half turn
Left half turn w. acc
Left half turn w. decel
Left outwardspiral
Left outward spiral w. acc
Left outward spiral
w. decel
Left inwardspiral
Left inward spiral w. acc
Left inward spiral
w. decel Constant velocity
Speed up
Slow down
Left half turn
Left half turn w. acc
Left half turn w. decel
Left outwardspiral
Left outward spiral w. acc
Left outward spiral
w. decel
Left inwardspiral
Left inward spiral w. acc
Left inward spiral
w. decel Constant velocity
Speed up
Slow down
Left half turn
Left half turn w. acc
Left half turn w. decel
Left outwardspiral
Left outward spiral w. acc
Left outward spiral
w. decel
Left inwardspiral
Left inward spiral w. acc
Left inward spiral
w. decel Constant velocity
Speed up
Slow down
Event Recognition: Sequence Data Learning
Similarity measurement difficult Sequence data with temporal correlation may
not have a vector space representation However, kernel methods (e.g., SVM) are
applicable No vector space representation OK But with feature space representation
Use DP algorithm for feature space distance metric Use hierarchical kernel recognition and fusion
Event Recognition: Imbalanced Data Set Negative samples significantly
outnumber positive samples Bayesian risk associated with
false negative significantly outweighs false positive
Adaptive conformal mapping at decision boundary
Event Recognition: Statistical Modeling
HMM is expensive to build
Not all behaviors are structured (e.g., loitering behaviors)
It may not be necessarily to understand individual activities before recognizing interaction
Distinguish interaction patterns Following Following-and-
gaining Stalking
Experimental Results: Syntactic Matching
Experimental Results: Semantic Indexing
Experimental Results: Biased Learning
=TP/(TP+FN)
=TN/(TN+FP)
threshold
penalty
Experimental Results: Statistical Learning
Results
Relevant Publications
Many details are omitted Sensor registration (spatial and temporal) Object tracking (Kalman and multi-state) Power management and routing
1. L. Jiao, G. Wu, Y. Wu, E. Y. Chang, and Y. F. Wang, “The Anatomy of A Multi-Camera Video Surveillance System,'' to appear in the ACM Multimedia System Journal.
2. K. Wu, J. Long, D. Han, and Y. F. Wang, “Human Activity Detection and Recognition for Video Surveillance,” Proceedings of IEEE International Conference on Multimedia Computing and Systems, 2004.
3. Edward Chang and Yuan-Fang Wang, "Toward Building a Robust and Intelligent video Surveillance System: A Case Study," (invited paper) Proceedings of the IEEE Multimedia and Expo Conference, Taipei, Taiwan, 2004.
4. R. Rangaswami, Z. Dimitrijevic, K. Kakligian, Edward Chang, and Yuan-Fang Wang, "The SfinX Video Surveillance System," Proceedings of the IEEE Multimedia and Expo Conference, Taipei, Taiwan, 2004.
5. G. Wu, Y. Wu, L. Jiao, Y. F. Wang, and E. Y. Chang, `”Multi-camera Spatio-temporal Fusion and Biased Sequence-data Learning for Security Surveillance,'' Proceedings of ACM Multimedia Conference, Berkeley, CA, 2003.
6. K. Wu, J. Long, D. Han, and Y. F. Wang, “Real-Time Multi-person Tracking in Video Surveillance,” Proceedings of the Pacific Rim Multimedia Conference, Singapore, 2003.
7. Y. Wu, L, Jiao, G. Wu, E. Chang, and Y. F. Wang, “Invariant Feature Extraction and Biased Statistical Inference for Video Surveillance,” Proceedings of the IEEE International Conference on Advanced Video and Signal-based Surveillance, Miami, FL, 2003.
Focus of This Seminar
Video-based face tracking, modeling and recognition
Human activity and interaction analysis
Video-Based Face Tracking & Recognition
Image-based Image normalization Feature selection Face recognition
Video-based Face region detection Tracking Face modeling and recognition
Difficulties
Quality of video is low Large illumination, pose variation, occlusion
Face images are small Compared to still image-based system
Model construction and fitting Generic vs. personal-specific 2D vs. 3D
Proposed Approach: Resolution Enhancement
Exploit multiple image frames and spatial coherency Single camera super-resolution (digital zoom) Multi-camera (master-slave) face region detection and
zooming (optical zoom) Need feature appearance (PCA + LDA) and
geometrical relations
General Framework: Visual Servoing
A Feedback control mechanism Reference and real signals are computed
from images
- J-1 Camera Control +
External Disturbance
New Image
FeatureDection
Referencesignal
Realsignal
Errorsignal
Controlsignal
Master-Slave Combo Setup
slaveslaveslaveworldworldmastermaster
slaveworldslaveworldmastermasterworldworldslaveslave
worldworldmastermaster
zf
ff
pTTp
pTTpPTp
PTp
),,,(
),,(),,(
1
X
Y
Z
X
X
Y
YZ Z
fslavep
),,,( slaveslaveworld zfT
worldmasterT
Mater: Anatomy-Guided Face Modeling
Face region localization based on anatomy Face region detection based on skin color
segmentation Face region modeling based on ellipse fitting Face region tracking using mean-shift tracker
X
YZ
worldmasterT
X
Y
Z
Slave: Master-Guided Zooming
X
Y
Z
X
X
Y
YZ Z
fslavep
),,,( slaveslaveworld zfT
worldmasterT
What’s Next?
View-based recognition Frontal-view detection Multi-frame evidence aggregation 3D model (?)
Single Camera Super resolution
Multiple, spatially-coherent frames as down-sampled, low-resolution (LR) images of original high-resolution (HR) images
Mathematically
)(
,)(
2,)(
1,)(
,1)(
12)(
11
)()(2
)(1
)(1
)(12
)(11
kncmc
kmc
kmc
knc
kk
kmn
km
km
kn
kkk
kkkkk
IIIIII
IIIIII
I
I
nITBDI
Three components: Spatial registration function
(T) Blurring function (B) Down-sampling function
(D) c: down-sampling factor
Spatial Registration Function
Modeled as affine transform Capture translation, rotation, and zooming In reality, only translation motion has been
successfully demonstrated
yyy
xxxk cba
cbaT
Blurring Function
Modeled as Gaussian kernel Caveats:
point spread function (blurring) function may not be known and is wave-length dependent
Diffraction effect induces ripples and is better modeled with Besel functions
Numerical Solution
Large system of equations Require preconditioning
Not sure that it will work in the real world Simpler mechanism (e.g., bi-linear
interpolation) exists with inferior performance
Optical zoom instead of digital zoom
Schedule 9/29: overview 10/6: Dan: face recognition overview 10/13: no meeting (research travel) 10/20: Dr. Kang 10/27: 11/3: 11/10: 11/17: 11/24:
Video-based face modeling and recognition Super resolution
Multiple images Space-time
Human activity/interaction analysis
Video-based face modeling and recognition Super resolution
Multiple images Space-time
Human activity/interaction analysis
Recommended