View
215
Download
0
Category
Preview:
Citation preview
8/7/2019 2008 Base Paper
1/13
372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Video-Based Human Movement Analysis and ItsApplication to Surveillance Systems
Jun-Wei Hsieh, Member, IEEE, Yung-Tai Hsu, Hong-Yuan Mark Liao , Senior Member, IEEE, andChih-Chiang Chen
AbstractThis paper presents a novel posture classificationsystem that analyzes human movements directly from video se-quences. In the system, each sequence of movements is convertedinto a posture sequence. To better characterize a posture in asequence, we triangulate it into triangular meshes, from which weextract two features: the skeleton feature and the centroid contextfeature. The first feature is used as a coarse representation ofthe subject, while the second is used to derive a finer description.We adopt a depth-first search (dfs) scheme to extract the skeletalfeatures of a posture from the triangulation result. The proposedskeleton feature extraction scheme is more robust and efficient
than conventional silhouette-based approaches. The skeletal fea-tures extracted in the first stage are used to extract the centroidcontext feature, which is a finer representation that can charac-terize the shape of a whole body or body parts. Thetwo descriptorsworking together make human movement analysis a very efficientand accurate process because they generate a set of key posturesfrom a movement sequence. The ordered key posture sequence isrepresented by a symbol string. Matching two arbitrary actionsequences then becomes a symbol string matching problem. Ourexperiment results demonstrate that the proposed method is arobust, accurate, and powerful tool for human movement analysis.
Index TermsBehavior analysis, event detection, stringmatching, triangulation.
I. INTRODUCTION
THE analysis of human body movements can be applied in a
variety of application domains, such as video surveillance,
video retrieval, human-computer interaction systems, and med-
ical diagnoses. In some cases, the results of such analysis can be
used to identify people acting suspiciously and other unusual
events directly from videos. Many approaches have been pro-
posed for video-based human movement analysis [1][6], [9],
[10]. For example, Oliver et al.. [9] developed a visual surveil-
lance system that models and recognizes human behavior using
hidden Markov models (HMMs) and a trajectory feature. Ros-
ales and Sclaroff [10] proposed a trajectory-based recognition
Manuscript received July 12, 2006; revised August 4, 2007. This work wassupported in part by the National Science Council of Taiwan, R.O.C., underGrant NSC952221-E-155071 and by the Ministry of Economic Affairs underContract 94-EC-17-A-02-S1032. The associate editor coordinating the reviewof this manuscript and approving it for publication was Prof. Kiyoharu Aizawa.
J.-W. Hsieh, Y.-T. Hsu, and C.-C. Chen are with the Department of Elec-trical Engineering, Yuan Ze University, Chung-Li 320, Taiwan, R.O.C. (e-mail:shieh@saturn.yzu.edu.tw; s937114@mail.yzu.edu.tw; s929106@mail.yzu.edu.tw).
H.-Y. M. Liao is with the Institute of Information Science, Academia Sinica,Taipei, Taiwan, R.O.C. (e-mail: liao@iis.sinica.edu.tw).
Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2008.917403
system to detect pedestrians in outdoor environments and rec-
ognize their activities from multiple views based on a mixture of
Gaussian classifiers. Other approaches extract human postures
or body parts (such as the head, hands, torso, or feet) to ana-
lyze human behavior [5], [6]. Mikic et al.. [5] used complex 3-D
models and multiple video cameras to extract 3-D information
for 3-D posture analysis, while Werghi [6] extracted 3-D infor-
mation and applied wavelet transforms to analyze 3-D human
postures. Although 3-D features are more informative than 2-D
features (like contours or silhouettes) for classifying human pos-tures, the inherent correspondence problem and high computa-
tional cost make this approach inappropriate for real-time ap-
plications. As a result, a number of researchers have based their
analyses of human behavior on 2-D postures. In [2], Cucchiara
et al.. proposed a probabilistic posture classification scheme
to identify several types of movement, such as walking, run-
ning, squatting, or sitting. Ivanov and Bobick [11] presented a
2-D posture classification system that uses HMMs to recognize
human gestures and behaviors. In [13], Wren et al. proposed
a Pfinder system for tracking and recognizing human behavior
based on a 2-D blob model. The drawback of this approach
is that incorporating 2-D posture models into the analysis ofhuman behavior raises the possibility of ambiguity between the
adopted models and real human behavior due to the mutual oc-
clusions between body parts and loose clothes. In [17] and [18],
Guo et al.. used a stick model to represent the parts of a human
body as sticks and link the sticks with joints for human motion
tracking. Then, in [19], Ben-Arie et al. used a voting technique
to recover the parameters of stick model from video sequences
for action analysis. In addition to the stick model, the cardboard
model [20] is widely recognized to be good for modeling ar-
ticulated human motions. However, the prerequisite that body
parts must be well segmented makes this model inappropriate
for real-time analysis of human behavior.
To solve the problem with the cardboard model, Park andAggarwal [15] used a dynamic Bayesian network to segment
a body into different parts [16]. This blob-based approach is
very promising for analyzing human behavior at the semantic
level, but it is very sensitive to changes in illumination. In addi-
tion to blobs, several researchers have proposed for classifying
human postures based on silhouettes. Weik and Liedtke [12]
traced the negative minimum curvatures along body contours
to segment body parts and then identified body postures using
a modified Iterative Closest Point (ICP) algorithm. Fujiyoshi et
al. [8] presented a skeleton-based method to recognize postures
by extracting skeletal features based on curvature changes along
the human silhouette. In addition, Chang and Huang [7] used
1520-9210/$25.00 2008 IEEE
8/7/2019 2008 Base Paper
2/13
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 373
Fig. 1. Flowchart of the proposed system.
different morphological operations to extract skeletal features
from postures and then identified movements using a HMM
framework. Although the contour-based approach is simple and
efficient for coarse classification of human postures, it is vul-
nerable to the effects of noise, imperfect contours, or occlu-
sions. Another approach used to analyze human behavior is the
Gaussian probabilistic model. In [2], Cucchiara et al. used a
probabilistic projection map to model postures and performed
frame-by-frame posture classification to recognize human be-
havior. The method uses the concept of a state-transition graph
to integrate temporal information about postures in order to deal
with the occlusion problem. However, the projection histogram
is not robust because it changes dramatically under differentlighting conditions.
In this paper, we propose a novel triangulation-based system
that analyzes human behavior directly from videos. The detailed
components of the system are shown in Fig. 1. First, we apply
background subtraction to extract body postures from video se-
quences and then derive their boundaries by contour tracing.
Next, a Delaunay triangulation technique [14] is used to divide
a posture into different triangular meshes. From the triangula-
tion result, two important features, the skeleton and the cen-
troid context (CC), are extracted for posture recognition. The
skeleton feature is used for a coarse search, while the centroid
context, being a finer feature, is used to classify postures moreaccurately. To extract the skeleton feature, we propose a graph
search method that builds a spanning tree from the triangulation
result. The spanning tree corresponds to the skeletal structure
of the analyzed body posture. The proposed skeleton extrac-
tion method is much simpler than silhouette-based techniques
[8][13]. In addition to the skeletal structure, the tree providesimportant information for segmenting a posture into different
body parts. Based on the result of body part segmentation, we
construct a new posture descriptor, namely, the centroid con-
text descriptor, for recognizing postures. This descriptor uti-
lizes a polar labeling scheme to label every triangular mesh with
a unique number. Then, for each body part, a feature vector,
i.e., the centroid context, is constructed by recording all relatedfeatures of the centroid of each triangular mesh according to
this unique number. We can then compare different postures
more accurately by measuring the distance between their cen-
troid contexts. Each posture is assigned a semantic symbol so
that each human movement can be converted and represented
by a set of symbols. Based on this representation, we use a new
string matching scheme to recognize different human move-
ments directly from videos. The string-based method modifiesthe calculation of the edit distance by using different weights to
measure the operations of insertion, deletion, and replacement.
Because of this modification, even if two movements involvelarge-scale changes, their edit distance is still small. The slow
increase in the edit distance substantially reduces the effect ofthe time warping problem. The experiment results demonstrate
Fig. 2. Sampling of control points. (a) A point with a high curvature. (b) Twopoints with high curvatures that are too close to each other.
the feasibility and superiority of the proposed approach for an-
alyzing human behavior.
The remainder of the paper is organized as follows. The De-
launay triangulation technique is introduced in Section II. In
Section III, we describe the posture classification scheme, which
is based on the skeletal feature. In Section IV, the centroid con-
text for posture classification is discussed. The key posture se-lection and string matching techniques are detailed in Section V.
The experiment results are given in Section VI. We then present
our conclusions in Section VII.
II. DEFORMABLE TRIANGULATIONS
In this paper, it is assumed that all video sequences are cap-
tured by a stationary camera. When the camera is static, the
background of the captured video sequence can be reconstructed
using a mixture of Gaussian models [24]. This allows us to de-
tect some samples and extract foreground objects by subtracting
the background. We then apply some simple morphological op-
erations to remove noise. After these pre-processing steps, we
partition a foreground human posture into triangular meshes
using the constrained Delaunay triangulation technique, and ex-
tract both the skeletal features and the centroid context directly
from the triangulation result [25].
Assume that is a posture extracted in binary form by
image subtraction. To triangulate , a set of control points
must be extracted along its contour in advance. Let be the
set of boundary points along the contour of . We extract some
high curvature points from as the set of control points. Let
be the angle of a point in . As shown in Fig. 2(a), the
angle can be determined by two specified points, and ,
selected from both sides of along to satisfy
(1)
where and are two thresholds set to and
, respectively. is the total length of contour of .
With and , the angle can be determined by
(2)
If is less than a threshold (here we set it at 150 ), is
selected as a control point. In addition to (2), it is required that
two control points must be far apart. This enforces the constraint
that the distance between any two control points must be largerthan the threshold (defined in (1)). As shown in Fig. 2(b),
8/7/2019 2008 Base Paper
3/13
374 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Fig. 3. All the vertices are labeled anticlockwise such that the interior ofV
is
located on their left.
if two candidates, and , are close to each other, i.e.,
, the candidate with the smaller angle is chosen as
a control point.
Assume is the set of control points extracted along the
boundary of . Each point on is labeled anticlockwise and
modulated by the size of . If any two adjacent points on
are connected by an edge, is deemed to be a planar graph,
i.e., a polygon. We adopt the constrained Delaunay triangula-
tion technique proposed by Chew [26] to divide into trian-
gular meshes.
As shown in Fig. 3, is the set of interior points of in. For a triangulation , is said to be a constrained
Delaunay triangulation of if a) each edge of is an edge of
and b) for each remaining edge of , there exists a circle
such that (1) the endpoints of are on the boundary of ,
and (2) if a vertex on is inside , then it cannot be seen from
at least one of the endpoints of . More precisely, given three
vertices , , and on , the triangle belongs
to the constrained Delaunay triangulation if and only if
(i) , where
;
(ii) , where is a circum-circle of
, , and . That is, the interior of doesnot have a vertex .
Based on the above definition, Chew [26] proposed a divide-
and-conquer algorithm to execute a constrained Delaunay tri-
angulation of in time. The algorithm works re-
cursively. When contains only three vertices, itself is the
triangulation result. However, when contains more than three
vertices, we choose an edge from and search for the corre-
sponding third vertex that satisfies properties i) and ii). Then, we
subdivide into two sub-polygons and . The same pro-
cedure is applied recursively to and until the processed
polygon contains only one triangle. In summary, the four steps
of the above algorithm are as follows:
1. Choose a starting edge .
2. Find the third vertex of that satisfies properties (i) and
(ii).
3. Subdivide into two sub-polygons:
and .
4. Repeat Steps 13 on and until the processed polygon
consists of only one triangle.
Details of this algorithm can be found in Bern and Eppstein
[23]. Fig. 4 shows an example of the triangulation result whenthe algorithm is applied to a human posture.
Fig. 4. Triangulation result of a body posture. (a) Input posture. (b) Triangula-
tion result of (a).
Fig. 5. Flowchart of posture classification using skeletons.
III. SKELETON-BASED POSTURE RECOGNITION
In this work, we extract the skeleton and the centroid context
as features from the triangulation result. Conventionally, the ex-
tracted skeleton is based on the contour of an object. For ex-ample, in [11], Fujiyoshi et al. extract feature points with min-
imum curvatures from the silhouette of a human body to define
the skeleton. However, as the method is vulnerable to noise, we
use a graph search scheme to solve the problem.
A. Triangulation-Based Skeleton Extraction
In Section II, we presented a technique for triangulating a
human body image into triangular meshes. By connecting all the
centroids of any two connected meshes, a graph can be formed.
In this section, we use a depth-first search technique to find the
skeleton that will be used for posture recognition. Fig. 5 shows
the block diagram of posture classification using skeletons. Inwhat follows, details of each block will be described.
Assume that is a binary posture. Using the above tech-
nique, is decomposed into a set of triangular meshes , i.e.,
. Each triangular mesh in has a
centroid , and two meshes, and , are connected if they
share one common edge. Then, based on this connectivity,
can be converted into an undirected graph , where all cen-
troids in are nodes on ; and an edge exists between
and if and are connected. The degree of a node
on the graph is defined by the number of edges connected to it.
Then, we perform a graph search on to extract the skeleton
of .To derive a skeleton based on a graph search, we seek a node
whose degree is one and whose position is the highest among
all the nodes on . is defined as the head of . Then, we
perform a depth-first search [21] from to construct a spanning
tree in which all leaf nodes correspond to different limbs of
. The branch nodes (whose degree is three on ) are the
key points used to decompose into different body parts, such
as the hands, feet, or torso. Let be the centroid of , and
let be the union of , , , and . The skeleton, ,
of can be extracted by directly linking any two nodes in
if they are connected through other nodes (or a path existing
between the two nodes [21]). Next, we summarize the algorithm
for skeleton extraction. The time complexity of this algorithmis , where is the number of triangle meshes.
8/7/2019 2008 Base Paper
4/13
8/7/2019 2008 Base Paper
5/13
376 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Fig. 8. Polar transform of a human posture.
Fig. 9. Bodycomponentextraction. (a)Triangulation resultof a posture; (b)thespanning tree of (a); and (c) centroids derived from different parts (determinedby removing all branch nodes).
Then, similar to the technique used in shape context [27], we
project a sample onto a polar coordinate and label each mesh.
Fig. 8 shows a polar transform of a human posture. We use
to represent the number of shells used to quantize the radialaxis and to represent the number of sectors that we want to
quantize in each shell. Therefore, the total number of bins used
to construct the centroid context is . For the centroid
of the triangular mesh of a posture, we construct a vector
histogram , in which
is the number of triangular mesh centroids in the th bin
when is considered as the origin, i.e.,
(6)
where is the th bin of the log-polar coordinate. Then,
given two histograms, and , the distance between
them can be measured by a normalized intersection [31]
(7)
where is the n umber of bins a nd denotes the n umber
of meshes calculated from a posture. Using (6) and (7), we can
define a centroid context to describe the characteristics of an
arbitrary posture .
In the previous section, we presented a tree search algorithm
that finds a spanning tree froma posture based onthetri-
angulation result. As shown in Fig. 9, (b) is the spanning tree de-
rived from (a). The tree captures the skeleton feature of .
Here, we call a node a branch node if it has more than one child.By this definition, there are three branch nodes in Fig. 9(b), i.e.,
Fig. 10. Multiple centroid contexts using different numbers of sectors andshells: (a) 4 shells and 15 sectors, (b) 8 shells and 30 sectors, and (c) centroidhistogram of the gravity center of the posture.
, , and . If we remove all the branch nodes from ,
it will be decomposed into different branch paths . Then,
by carefully collecting the set of triangular meshes along each
path, we findthat eachpath correspondstoonebody com-
ponent in . For example, in Fig. 9(b), if we remove from
, two branch paths will be formed, i.e., one from node
to and one from to node . The first path corresponds
to the head and neck of , and the second corresponds to theleft hand of . In some cases, the path may not correspond to
a high-level semantic body component exactly, as shown by the
path from to . However, if the length of a path is further
constrained, over-segmentation can be easily avoided. Since the
segmentation of body parts is an ill-posed problem, we do not
propose a complete system for segmenting a human posture into
high-level semantic body parts. Instead, we use the current de-
composition result to recognize postures up to the middle level
(using components similar to the body parts, but not real ones).
Given a path , we collect a set of triangular meshes
along it. Let be the centroid of the triangular mesh closest to
the center of the set of meshes. As shown in Fig. 9(c), is the
centroid extracted from the path that begins at and ends at .
Given a centroid , we can obtain its corresponding histogram
using (6). Assume that the set of these path centroids is
. Then, based on , the centroid context of is defined as
follows:
(8)
where is the number of elements in . Fig. 10 shows two
cases of multiple centroid contexts when the number of shells
and sectors is set to (4, 15) and (8, 30), respectively. The cen-
troid contexts are extracted from the head and the center of the
posture, respectively. (c) is the centroid histogram of the gravitycenter of the posture. Given two postures, and , the distance
between their centroid contexts is measured by
(9)
where and are the area ratios of the th and th body
parts residing in and , respectively. Based on (9), an arbi-
trary pair of postures can be compared. We now summarize, thealgorithm for finding the centroid context of a posture .
8/7/2019 2008 Base Paper
6/13
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 377
Algorithm for Centroid Context Extraction:
Input: the spanning tree of a posture .
Output: the centroid context of .
Step 1: Recursively trace using a depth first search until
the tree is empty. If a branch node (a node with two children)
is found, collect all the visited nodes to a new path and
remove these nodes from .
Step 2: If a new path only contains two nodes, eliminate
it; otherwise, find its path centroid .
Step 3: Using (6) to find the centroid histogram of each
path centroid .
Step 4: Collect all the histograms as the centroid
context of .
The time complexity of centroid context extraction is
, where is the number of triangle meshes in
and .
B. Posture Recognition Using the Skeleton and the Centroid
Context
Given a posture, we can extract its skeleton feature and cen-
troid context using the techniques described in Sections III-A
and IV-A, respectively. Then, the distance between any two pos-
tures can be measured using (5) (for the skeleton) or (9) (for the
centroid context). As noted previously, the skeleton feature is
used for a coarse search and the centroid context feature is used
for a fine search. To obtain better recognition results, the two
distance measures should be integrated. We use the followingweighted sum to represent the total distance:
(10)
where is the total distance between two postures
and and is a weight used to balance and
.
V. BEHAVIOR ANALYSIS USING STRING MATCHING
We represent each movement sequence of a human subject
by a continuous sequence of postures that changes over time. To
simplify the analysis, we convert a movement sequence into aset of posture symbols. Then, different movement sequences of
a human subject can be recognized and analyzed using a string
matching process that selects key postures as its basis. In the
following, we describe a method that automatically selects a
set of key postures from video sequences, and then present our
string matching scheme for behavior recognition.
A. Key Posture Selection
The movement sequences of a human subject are analyzed
directly from videos. Some postures in a sequence may be re-
dundant, so they should be removed from the modeling process.
To do this, we use a clustering technique to select a set of keypostures from a collection of surveillance video clips.
Assume that all the postures have been extracted from a video
clip and each frame has only one posture. In addition, assume
is the posture extracted from the th frame. Given two adjacent
postures, and , the distance between them, , is calcu-
lated using (10), where is set at 0.5. Assume is the average
value of for all pairs of adjacent postures. For a posture , if
is greater than , we define it as posture-change instance.By collecting all postures that correspond to posture-change in-
stances, we derive a set of key posture candidates . How-
ever, may still contain many redundant postures, which
would degrade the accuracy of the sequence modeling. To ad-
dress this problem, we use a clustering technique to find a better
set of key postures.
Initially, we assume each element in individually
forms a cluster . Then, given two cluster elements, and ,
in , the distance between them is defined by
(11)
where is defined in(10) and representsthenumber
of elements in . According to (11), we can execute an itera-
tive merging scheme to find a compact set of key postures from
. Let and be the th cluster and the collection of
all clusters , respectively, at the th iteration. At each itera-
tion, we choose a pair of clusters, and , whose distance
is the minimum among all pairs in , i.e.,
If is less than , then and are merged
to form a new cluster and construct a new collection of clusters
. The merging process is executed iteratively until no fur-
ther merging is possible. Assume is the final set of clusters.
Then, from the th cluster in , we can extract a key posture
that satisfies the following equation:
(12)
Based on (12) and checking all clusters in , the set of key
postsures , i.e., , can be constructed for
analyzing human movement sequences.
B. Human Movement Sequence Recognition Using String
Matching
Based on the results of key posture selection and posture clas-
sification, we can model different human movement sequences
with strings. For example, in Fig. 11, there are three kinds of
movement sequence: walking, picking up, and falling. Let the
symbols and denote the start and end points of a se-
quence. Then, the sequence shown in Fig. 11 can represented
by: (a) swwwe, (b) swwppwwe, and (c) swwwwffe, where
denotes walking, denotes picking-up, and denotes
falling. Based on this behavior parsing, different movement se-
quences can be compared using a string matching scheme.
Assume and are two movement sequences whose stringrepresentations are and , respectively. We use the edit
8/7/2019 2008 Base Paper
7/13
378 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Fig. 11. Three kinds of movement sequence taken from: (a) Walking. (b)Picking up. (c) Falling.
distance [30] between and to measure the dissimilarity
between and . The edit distance between and , is
defined as the minimum number of edit operations required to
change into . The operations include replacements, in-
sertions, and deletions. For any two strings, and , we de-
fine as the edit distance between and
. That is, is the minimum number of
edit operations needed to transform the first charactersof into the first characters of . In addition, let
be a function whose value is 0 if and
1 if . Then, we get ,
, and .
is calculated by the following recursive equation:
(13)
where the insertion, deletion, and replacement operations
correspond, respectively, to the transitions from cellto cell , from cell to cell , and from cell
to cell , respectively.
Assume that is a walking video clip whose string repre-
sentation is swwwwwwe. is different from that in Fig. 11(a)
due to the scaling problem along the time axis. According to
(13), the edit distance between and the sequence in Fig. 11(a)
is 3, while the edit distance between and the sequences
shown in Fig. 11(b) and (c) is 2. However, is more similar
to Fig. 11(a) than to Fig. 11(b) and (c). Clearly, (13) cannot
be directly applied to human movement sequence analysis
and must be modified. Thus, we define a new edit distance to
address this problem.
Let , , and be the respective costs of the inser-tion, replacement, and deletion operations performed at the
th and th characters of and , respectively. Then, (13)
can be rewritten as
(14)
Here, the replacement operation is considered more important
than insertion and deletion, since it means a change of pos-
ture type. Thus, the weight ofinsertion and deletion should
be scaled down to , where . Following this rule, if aninsertion is found when calculating , its weight
will be if ; otherwise, its weight will be 1.
This implies that . Similarly, for the
deletion operation, we obtain . If
, the weights of the insertion and deletion
operations would be smaller than the weight of the replace-
ment operations; therefore, it would be impossible to choose
replacement as the next operation. This problem can be easilysolved by setting the weight to so that
and will n ot be compared when calculating .
However, they will be compared when calculating
. Since the same end symbol e appears in and ,
the final distance will be equal to
. Thus, the delay in comparison will
not cause an error; however, the cost of insertion and dele-
tion will increase if any incorrect editing operations are chosen.
Therefore, (14) should be modified as follows:
(15)
where and
. In this work, is consistently set to 0.1 because we
consider the influence of replacement to be more significant than
that of insertion and deletion. In the experiments, we obtained
satisfactory results by using the above setting. The algorithm of
string matching to calculate the edit distance between and
is summarized as follows.
Algorithm for String Matching:
Input: two strings and .
Output: the edit distance between and .
Step 1: and .
Step 2: and for all
and
Step 3: For all and do
Step 4: Return .
VI. EXPERIMENTAL RESULTS
To test the efficiency and effectiveness of the proposed ap-
proach, we constructed a large database of more than thirty
thousand postures, from which we obtained over three hun-
dred human movement sequences. The ratio of female type in
this database is 20%. In each sequence, we recorded a spe-
cific type of human behavior: walking, sitting, lying, squatting,
falling, picking up, jumping, climbing, stooping, or gymnas-
tics. Fig. 12 illustrates the skeleton-based posture recognition
system. The left-hand side of the figure shows a person walking,
while the right-hand side shows twelve pre-determined key pos-
tures for matching. Among the twelve key postures, which were
determined by a clustering algorithm, the one surrounded bya bold frame matches the current posture of the query action
8/7/2019 2008 Base Paper
8/13
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 379
Fig. 12. Results of walking behavior analysis using the skeleton feature. Therecognized posture is highlighted with a bold frame and shown in (b).
Fig. 13. Result of walking behavior analysis using the centroid context feature.The recognized type is highlighted with a bold frame and shown in (b).
sequence [see (b)]. Fig. 13 shows the centroid context-based
posture recognition system. The current posture shown on the
left-hand side matches the third key posture (surrounded by a
bold frame) located on the right-hand side of the figure [see (b)].
In the first set of experiments, the performance of key posture
selection was examined. Since there is no benchmark behavior
database in the literature, twenty movement sequences (two se-
quences per type) were chosen to evaluate the accuracy of our
proposed key posture selection algorithm. For each sequence, a
set of key postures was manually selected to serve as the groundtruth. Two postures were consideredcorrectly matched if their
distance was smaller than 0.1 [see (10)]. Let
and be the numbers of key frames selected by a manual
method and our approach, respectively. We can then define the
accuracy of key posture selection as follows:
where is the number of key postures selected by our
method, each of which matches one of the manually selected
key postures. is the number of manually selected key
postures, each of which correctly matches one of the automat-ically selected key postures. Fig. 14 shows the results of key
Fig. 14. Result of key posture selection from three action sequences: walking,squatting, and gymnastics.
Fig. 15. Results of skeleton extraction using different sampling rates. The firstrow shows the skeletons extracted by connecting all triangle centroids, whilethe second row shows the skeletons extracted by connecting all branch nodes.
posture selection from sequences of walking, squatting, and
gymnastics. Due to space limitations, only 36 key postures
are shown in Fig. 14. With this set of key postures, different
movement sequences can be converted into a set of symbols
and then recognized through a string matching method. In the
experiments, the average accuracy of key posture selection was
91.9%.
In the second set of experiments, we evaluated the perfor-
mance of skeleton extraction using triangulation. Fig. 15 shows
the results of skeleton extraction using different sampling rates.
The top row illustrates the skeletons extracted by connecting all
the triangle centroids, while the bottom row shows the skeletonsextracted by connecting all the branch nodes. It is obvious that
no matter which sampling rate was used, the skeletons were
always extracted; however, a higher sampling rate achieved a
better approximation result. Fig. 16 shows another example of
extracting a simple skeleton and a complex skeleton directly
from a profile posture. In this paper, a simple skeleton means
a skeleton derived by connecting all the branch nodes, and a
complex skeleton is a skeleton derived by connecting all the
centroids of all the triangles. To classify a posture, we have
to convert a skeleton into its corresponding distance map.
Fig. 17(a) and (b) show the results of applying the distance
transform to the skeletons shown in Fig. 16(a) and (b), respec-
tively. Table I details the success rates of posture recognitionwhen different numbers of meshes (resolutions) were used to
8/7/2019 2008 Base Paper
9/13
380 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Fig. 16. Skeleton extraction directly from a posture profile.(a) Simpleskeletonderived by connecting branch nodes and (b) complex skeleton derived by con-necting all triangle centroids.
Fig. 17. Result of distance transform. (a) Distance map extracted fromFig. 16(a) and (b) distance map extracted from Fig. 16(b).
TABLE ISUCCESS RATES OF POSTURE RECOGNITION WHEN DIFFERENT NUMBERS OF
MESHES WERE USED FOR SKELETON EXTRACTION
extract both simple and complex skeletons. From the table,
we observe that 1) the performance obtained using complex
skeleton representation was better than that derived using the
simple skeleton and 2) the more meshes used for representation,
the better the performance of posture recognition.
We explain how the posture recognition process works by
the following example. The top row of Fig. 18 shows a number
of input postures represented by simple skeletons. Their corre-
sponding binary maps are shown in the bottom row of the figureusing dark shading. The associated silhouettes, shown in the
second row, represent the corresponding posture types retrieved
from the posture database. Fig. 19 compares the performance
of the simple and complex skeletons in recognizing postures
from different types of action (movement) types. From Fig. 19,
it is obvious that the complex skeleton consistently outperforms
the simple skeleton. The reason why the above results were ob-
tained was due to the fact that the simple skeleton loses more
information than the complex skeleton. Therefore, the complex
one performs better than the simple one in terms of accuracy.
The next set of experiments was conducted to evaluate the
performance of the centroid context feature. Since the results
of this experiment were influenced by the number of predefinedsectors and shells, we conducted six sets of experiments. They
Fig. 18. Posture recognition results when the simple skeleton is used. The firstrow shows the skeletons of all query postures. The silhouettes in the second rowshow the corresponding posture type retrieved from the database.
Fig. 19. Success rates of posture recognition using the simple skeleton andcomplex skeleton on different movement types.
Fig. 20. Different numbers of sectors and shells used to define the centroidcontext: (a) 4 shells and 15 sectors and (b) 8 shells and 30 sectors.
were (4, 15), (4, 20), (4, 30), (8, 15), (8, 20), and (8, 30), in which
the first argument represents the number of shells and the second
represents the number of sectors. Fig. 20 shows two types of
polar partition. Table II lists the recognition rates when different
(shell, sector) parameter sets were predefined and applied to the
single centroid context and the multiple centroid context cases.In the former case, the center was set at the gravity center of a
8/7/2019 2008 Base Paper
10/13
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 381
TABLE II
RECOGNITION RATES OF POSTURES WHEN DIFFERENT NUMBERSOF SECTORS AND SHELLS WERE USED
Fig. 21. Posture recognition results using multiple centroid contexts.
Fig. 22. Comparison of posture recognition accuracy for different behaviortypes.
posture, but in the latter case, the center of each component was
taken as the gravity center of every partitioned component. From
the recognition rates reported in Table II, it is obvious that using
a multiple centroid context is definitely better than using a single
centroid context. Thereby, in our behavior analysis system, themultiple centroid context is implemented and adopted for pos-
ture classification. In different parameter sets, we found that the
best setting for the (shell, sector) pair was (4, 30). Therefore, we
used this setting in all the experiments.
Fig. 21 shows the results when multiple centroid contexts
were adopted. The images in the top row show how multiple
centroid contexts cover four different postures. In these experi-
ments, we only used two centroid contexts to represent each pos-
ture. In the bottom row, we retrieved database postures (in sil-
houette form) and then superimposed their corresponding query
postures (in shaded form). All the retrieved postures were actu-
ally very close to the query postures. Fig. 22 lists the recogni-
tion rates for different types of behavior when the multiple cen-troid context approach was adopted. For all movement types,
Fig. 23. Posture recognition results using (from left to right) the CC, ShapeContext, PPM, and Zernike moment methods, respectively. The shaded regionshows the input query, and the silhouettes are their recognized posture types.
TABLE IIIFUNCTIONAL COMPARISONS AMONG CENTROID CONTEXT, S HAPE CONTEXT,
PPM, AND MOMENTS.3 K
: NUMBER OF MESHES;v
: NUMBER OF BODY
PARTS;n
: NUMBER OF BOUNDARY POINTS;( w ; h )
: DIMENSION OF THE INPUTPOSTURE; L: NUMBER OF MOMENT ORDERS. T: TRANSLATION; S: SCALING;
R: R OTATION. IT IS NOTICED THATn K v
the average retrieval accuracy rate was 88.9%. In another set
of experiments, we compared our multiple centroid context ap-
proach with three other approaches namely: the shape context-
based approach [27], the Zernike moment-based approach [29],
and the probabilistic projection map-based (PPM) approach [2].
Fig. 23 illustrates the recognition results for a walking pos-
ture (the shaded region). The silhouettes in the figure, from left
to right, represent the database postures retrieved by using themultiple centroid context approach [see (10)], the shape con-
text approach, the PPM approach, and the Zernike approach, re-
spectively. The shape context approach and the PPM approach
misclassified the posture type as running. The Zernike ap-
proach correctly recognized the posture type as the walking
mode. However, the recognized posture belonged to the next
key posture of the same movement sequence. Table III lists
the functional comparisons among these methods. The Zernike
approach has several invariance properties including rotation,
translation, and scaling. However, the calculation of moments
at different orders is very time consuming. The PPM approach
is simple and efficient but not robust to noise. The shape contextis robust to noise but with a higher time complexity. Our cen-
troid context is robust to noise and can represent a posture up
to a syntactic level. Due to its region-based nature, it can work
very efficiently to classify postures. Fig. 24 compares the per-
formance of the four approaches for ten behavior types. In terms
of accuracy, our approachs performance was similar to that of
the shape context approach. However, our approach is better at
dealing with sequences that involve significant movements, such
as climbing and walking. Most importantly, our method has a
lower time complexity than the shape context method because
it is a region-based descriptor, whereas the shape-context ap-
proach is pixel-based.
Next, we conducted a series of experiments to evaluatemovement sequence recognition using the proposed string
8/7/2019 2008 Base Paper
11/13
382 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Fig. 24. Comparison of the performance of different approaches.
TABLE IVANALYSIS OF TEN BEHAVIOR TYPES USING OUR PROPOSED
STRING MATCHING SCHEME
matching scheme. We collected 450 movement sequences,
divided equally among the ten behavior types (i.e., each be-
havior type had 45 movement sequences). The experiment
results, listed in Table IV, show that most of the test sequences
were correctly identified and classified into the appropriate
categories. However, some of the squat sequences tended
to be misclassified as picking up behavior due to their high
degree of similarity to squatting. In this set of experiments, the
average accuracy of behavior analysis was 94.5%.
To test the proposed approach on real-world applications, we
analyzed irregular (or suspicious) human movements captured
by a surveillance camera. First, we extracted a set of normalkey postures from video sequences to learn regular human
movements like walking or running. After a training process, we
took unknown surveillance sequences and determined whether
the postures in them were regular or not. If some irregular pos-
tures appear continuously, an alarm message triggers a security
warning. For example, in Fig. 25(a), a set of normal key pos-
tures was extracted from a walking sequence. From these pos-
tures, we can tell that the posture in (b) is a normal posture.
However, the posture in (c) could be classified as abnormal,
since the person adopted a bending-down posture. We used a
marker to label these abnormal posture sequences until they re-
turned to normal. Fig. 26 shows another two cases of abnormal
posture detection. The postures in Fig. 26(a) and (c) were rec-ognized as normal, since they were similar to the postures in
Fig. 25. Abnormal activity detection. (a) Five key postures defining severalnormal human movements. (b) Normal condition is detected. (c) Warning mes-sage is triggered when an abnormal posture is detected.
Fig. 26. Abnormal posture detection. (a) and (c): Normal postures were de-tected. (b) and (d): Abnormal postures were detected due to the unexpected
shooting and climbing wall postures, respectively.
Fig. 25(a). However, the postures in Fig. 26(b) and (d) were clas-
sified as abnormal, since the subjects adopted shooting and
climbing wall postures. The abnormal posture detection func-
tion improves a video surveillance system in two ways. First, it
saves a great deal of storage space because it is only necessary to
store abnormal postures. Second, it responds immediately when
an abnormal posture is identified.
VII. CONCLUSION
We have proposed a method for analyzing human movements
directly from videos. Specifically, the method comprises the
following.
1) A triangulation-based technique that extracts two im-
portant features, the skeleton feature and the centroid
context feature, from a posture to derive more semantic
meaning. The features form a finer descriptor that can
describe a posture from the shape of the whole body or
from body parts. Since the features complement each
other, all human postures can be compared and classi-
fied very accurately.
2) A clustering scheme for key posture selection, which
recognizes and codes human movements using a set ofsymbols.
8/7/2019 2008 Base Paper
12/13
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 383
3) A novel string-based technique for recognizing human
movements. Even though events may have different
scaling changes, they can still be recognized.
The average accuracies of posture classification and behavior
analysis were 95.16% and 94.5%, respectively. The experiment
results show that the proposed method is superior to other ap-
proaches for analyzing human movements in terms of accuracy,robustness, and stability.
REFERENCES
[1] T. B. Moeslund and E. Granum, A survey of computer vision-basedhuman motion capture, Comput. Vis. Image Understand., vol. 81, no.3, pp. 231268, Mar. 2001.
[2] R.Cucchiara, C. Grana, A. Prati,and R. Vezzani, Probabilities postureclassification for human-behavior analysis, IEEE Trans. Syst., Man,Cybern. A, vol. 35, no. 1, pp. 42 54, Jan. 2005.
[3] S. Haritaoglu, D. Harwood, and L. S. Davis, W4: Real-time surveil-lance of peopleand their activities,IEEE Trans. Pattern Anal. Machine
Intell., vol. 22, no. 8, pp. 809830, 2000.[4] S. Maybank and T. Tan, Introduction to special section on visual
surveillance, Int. J. Comput. Vis., vol. 37, no. 2, pp. 173173, 2000.
[5] I. Mikic et al., Human body model acquisition and tracking usingvoxel data, Int. J. Comput. Vis., vol. 53, no. 3, pp. 199223, 2003.
[6] N. Werghi, A discriminative 3D wavelet-based descriptors: Applica-tion to the recognition of human body postures, Pattern Recognit.
Lett., vol. 26, no. 5, pp. 663677, 2005.[7] I.-C. Chang and C.-L. Huang, The model-based human body motion
analysis, Image Vis. Comput., vol. 18, no. 14, pp. 10671083, 2000.[8] H. Fujiyoshi, A. J. Lipton, and T. Kanade, Real-time human motion
analysis by image skeletonization, IEICE Trans. Inform. Syst., vol.E87-D, no. 1, pp. 113120, Jan. 2004.
[9] N. Oliver, B. Rosario, and A. Pentland, A Bayesian computer visionsystem for modeling human interactions, IEEE Trans. Pattern Anal.
Machine Intell., vol. 22, no. 8, pp. 831843, Aug. 2000.[10] R. Rosales and S. Sclaroff, 3D trajectory recovery for tracking mul-
tiple objects and trajectory guided recognition of actions, in Proc.IEEE Conf. On Computer Vision and Pattern Recognition, Jun. 1999,
vol. 2, pp. 637663.[11] Y. Ivanov and A. Bobick, Recognition of visual activities and inter-
actions by stochastic parsing, IEEE Trans. Pattern Recognit. MachineIntell., vol. 2, no. 8, pp. 852872, Aug. 2000.
[12] S. Weik and C. E. Liedtke, Hierarchical 3D pose estimation for artic-ulated human body models from a sequence of volume data, in Proc.
Int. Workshop on Robot Vision 2001, Auckland, New Zealand, Feb.
2001, pp. 2734.[13] C. R. Wren et al., Pfinder: Real-time tracking of the human body,
IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 7, pp. 780785,Jul. 1997.
[14] J. Shewchuk, Delaunay refinement algorithms for triangular meshgeneration, computational geometry, Theory Applicat., vol. 23, pp.2174, May 2002.
[15] S. Park and J. K. Aggarwal, Segmentation and tracking of interactinghuman body parts under occlusion and shadowing, in Proc. IEEEWorkshop on Motion and Video Computing, Orlando, FL, 2002, pp.
105111.[16] S. Park and J. K. Aggarwal, Semantic-level understanding of human
actions and interactions using event hierarchy, in Proc. 2004 Conf.Computer Vision and Pattern Recognition Workshop, Washington, DC,
Jun. 2, 2004.
[17] Y. Guo, G. Xu, and S. Tsuji, Understanding human motion pat-terns, in Proc. IEEE Int. Conf. Pattern Recognition, 1994, vol. 2,pp. 325329.
[18] Y. Guo, G. Xu, and S. Tsuji, Tracking human body motion based ona stick figure model, J. Vis. Commun. Image Representat., vol. 5, pp.1p, 1994.
[19] Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, Human activity recog-nition using multidimensional indexing, IEEE Trans. Pattern Anal.
Machine Intell., vol. 24, no. 8, pp. 10911104, Aug. 2002.[20] S. X. Ju, M. J. Black, and Y. Yaccob,
Cardboard people: A parameter-
ized model of articulated image motion, in Proc. Int. Conf. AutomaticFace and Gesture Recognition, Killington, VT, 1996, pp. 3844.
[21] E.Horowitz, S. Sahni,and S. A. Freed, Fundamentals of Data Structure
in C. New York: W. H. Freeman.
[22] G. Borgefors, Distance transformations in digital images, Comput.Vis., Graph., Image Process., vol. 34, no. 3, pp. 344371, Feb. 1986.
[23] M. Bern andD. Eppstein, Mesh Generation and Optimal Triangulation,
Computing in Euclidean Geometry, 2nd ed. : World Scientific, 1995,pp. 47123.
[24] C. Stauffer and E. Grimson, Learning patterns of activity using real-
time tracking, IEEE Trans. Pattern Recognit. Machine Intell., vol. 22,no. 8, pp. 747757, 2000.
[25] D. Chetverikov and Z. Szabo, A simple and efficient algorithm for de-tection of high curvature points in planar curves, in Proc. 23rd Work-shop Austrian Pattern Recognition Group, 1999, pp. 175184.
[26] L. P. Chew, Constrained Delaunay triangulations, Algorithmica, vol.4, no. 1, pp. 97108, 1989.
[27] S. Belongie, J. Malik, and J. Puzicha, Shape matching and objectrecognition using shape contexts, IEEE Trans. Pattern Recognit. Ma-chine Intell., vol. 24, no. 4, pp. 509522, Apr. 2002.
[28] Y. Rui, A. C. She, and T. S. Huang, Modified Fourier descriptorsfor shape representation -a practical approach, in Proc. 1st Int. Work-shop on Image Databases and Multi-Media Search , Amsterdam, The
Netherlands, Aug. 1996.
[29] A. Kontanzad andY. H. Hong, Invariantimage recognition by Zernikemoments, IEEE Trans. Pattern Recognit. Machine Intell., vol. 12, no.
5, pp. 489497, May 1990.[30] E. Ukkonen, On approximate string matching, in Proc. Int. Conf. on
Foundations of Comp. Theory, 1983, pp. 487495.[31] M. J. Swain and D. H. Ballard, Color indexing, Int. J. Comput. Vis.,
vol. 7, no. 1, pp. 1132, 1991.
Jun-Wei Hsieh (M06) received the B.S. degree incomputer science from TongHai University, Taiwan,
R.O.C., in 1990 and the Ph.D. degree in computerengineering from the National Central University,Taiwan, in 1995.
From 1996 to 2000, he was a Researcher Fellowat the Industrial Technology Researcher Institute,Hsinchu, Taiwan, and managed a team to developvideo-related technologies. He is presently an As-sociate Professor with the Department of ElectricalEngineering, Yuan Ze University, Chung -Li, Taiwan.
His research interests include content-based multimedia databases, videoindexing and retrieval, computer vision, and pattern recognition.
Dr. Hsieh received the Best Paper Awards from the Image Processing andPattern Recognition Society of Taiwan in 2005, 2006, and 2007, and the Distin-guished Research Award from Yuan Ze University in 2006. He is a recipient of
the Phai-Tao-Phai Award.
Yung-Tai Hsu was born in Kaohsiung, Taiwan,R.O.C., in 1978. He received the B.S and M.Sdegrees in mechanical engineering from Yuan ZeUniversity, Taiwan, in 2000 and 2002, respectively.He is currently pursuing the Ph.D degree in elec-trical engineering at Yuan Ze University, Chung-Li,Taiwan.
His research interests include pattern recognition,computer vision, and machine learning.
Mr. Tsai received the Best Paper Award forthe 2006 IPPR Conference on Computer Vision,Graphics, and Image Processing.
8/7/2019 2008 Base Paper
13/13
384 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008
Hong-Yuan Mark Liao (SM01) received theB.S. degree in physics from National Tsing-HuaUniversity, Hsinchu, Taiwan, R.O.C., in 1981, andthe M.S. and Ph.D. degrees in electrical engineeringfrom Northwestern University, Evanston, IL, in 1985and 1990, respectively.
He was a Research Associate with the ComputerVision and Image Processing Laboratory, North-
western University, in 19901991. In July 1991,he joined the Institute of Information Science,Academia Sinica, Taipei, Taiwan, as an Assistant
Research Fellow. He was promoted to Associate Research Fellow and thenResearch Fellow in 1995 and 1998, respectively. From August 1997 to July
2000, he served as the Deputy Director of the institute. From February 2001to January 2004, he was the Acting Director of the Institute of AppliedScience and Engineering Research. He is jointly appointed as a Professor ofthe Computer Science and Information Engineering Department of NationalChiao-Tung University. His current research interests include multimediasignal processing, video surveillance, and content-based multimedia retrieval.
Dr. Liao is the Editor-in-Chief of the Journal of Information Science and En-
gineering. He is on the Editorial Boards of the International Journal of VisualCommunication and Image Representation, the EURASIP Journal on Advancesin Signal Processing, and Research Letters in Signal Processing. He was an As-sociate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA from 1998 to 2001.He received the Young Investigators Award from Academia Sinica in 1998, the
Excellent Paper Award from the Image Processing and Pattern Recognition So-ciety of Taiwan in 1998 and 2000, the Distinguished Research Award from theNational Science Council of Taiwan in 2003, and the National Invention Awardin 2004. He served as the Program Chair of the International Symposium onMultimedia Information Processing (ISMIP97), the Program Co-Chair of theSecond IEEE Pacific-Rim conference on Multimedia (2001), the ConferenceCo-Chair of the Fifth IEEE International Conference on Multimedia and Ex-position 2004 (ICME), and as Technical Co-Chair of the Eighth IEEE ICME(2007).
Chih-Chiang Chen was born in Taipei, Taiwan,R.O.C., in 1982. He received the B.S degree inelectrical engineering from Yuan Ze University,Taiwan, in 2004, where he is currently pursuing thePh.D degree in electrical engineering.
His research interests include image processing,behavior analysis, and video surveillance.
Recommended