13
2114 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009 Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues Aravind Sundaresan and Rama Chellappa, Fellow, IEEE Abstract—We present a completely automatic algorithm for initializing and tracking the articulated motion of humans using image sequences obtained from multiple cameras. A detailed articulated human body model composed of sixteen rigid segments that allows both translation and rotation at joints is used. Voxel data of the subject obtained from the images is segmented into the different articulated chains using Laplacian Eigenmaps. The segmented chains are registered in a subset of the frames using a single-frame registration technique and subsequently used to initialize the pose in the sequence. A temporal registration method is proposed to identify the partially segmented or unregistered articulated chains in the remaining frames in the sequence. The proposed tracker uses motion cues such as pixel displacement as well as 2-D and 3-D shape cues such as silhouettes, motion residue, and skeleton curves. The tracking algorithm consists of a predictor that uses motion cues and a corrector that uses shape cues. The use of complementary cues in the tracking alleviates the twin problems of drift and convergence to local minima. The use of multiple cameras also allows us to deal with the problems due to self-occlusion and kinematic singularity. We present tracking results on sequences with different kinds of motion to illustrate the effectiveness of our approach. The pose of the subject is correctly tracked for the duration of the sequence as can be verified by inspection. I. INTRODUCTION H UMAN pose estimation and tracking from video se- quences, or motion capture, has important applications in a variety of fields such as biomechanical and clinical analysis, human computer interaction, and animation. Current techniques use marker-based techniques, which involve the placement of markers on the body of the subject and capturing the movement of the subject using a set of specialized cameras. The use of markerless techniques eliminates the need for the specialized equipment as well as the expertise and time required to place the markers. It can also potentially measure the pose using anatomically appropriate models rather than estimating them from a set of markers. Different applications have different needs and use single or multiple cameras to estimate human pose and this problem has received much attention in the image processing and computer vision literature in both the monoc- ular [1]–[4] and multiple camera cases [5]–[9]. A survey of a Manuscript received April 27, 2008; revised March 26, 2009. First published May 05, 2009; current version published August 14, 2009. This work was sup- ported in part by NSF ITR 0325715. The associate editor coordinating the re- view of this manuscript and approving it for publication was Prof. Scott T. Acton. A. Sundaresan is with SRI International, Menlo Park, CA 94025 USA (e-mail: [email protected]). R. Chellappa is with the Center for Automation Research, University of Maryland, College Park, MD 20742 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2009.2022290 number of important pose estimation methods developed in the past decade may be found in [10]–[12]. The typical steps in motion capture [13] are (1) model estima- tion, (2) pose initialization, and (3) tracking. Model estimation is the process of estimating the parameters of the human body model such as the shape of the body segments and their articu- lated structure. Pose initialization refers to the estimation of the pose given a single frame. 1 Pose tracking refers to the estimation of a pose in the next frame, given the pose in the current frame. Both (2) and (3) perform pose estimation, but the methods em- ployed are usually different and we list them separately. The articulated structure of the human body which is com- posed of a number of segments, each with its associated shape and pose, makes human pose estimation a challenging task. The complexity of the human body and the range of poses it can as- sume necessitate the use of a detailed model in order to represent its pose and to guide pose estimation. Body models typically in- corporate both the shape of individual body parts and structural aspects such as the articulated connectivity and joint locations of the human body. Besides the sheer complexity of the human body, a common problem faced in image-based methods is that some parts of the body often occlude other parts (self-occlu- sion). It is, therefore, difficult to perceive and estimate motion in the direction perpendicular to the image plane when using images from a single camera. Morris and Rehg [14] refer to this problem as “kinematic singularity” and study it in some detail [14]. Monocular techniques suffer from the above prob- lems of self-occlusion and “kinematic singularities” and mul- tiple cameras are required to estimate pose in a robust and ac- curate manner. A. Related Work Gavrila and Davis [10], Aggarwal and Cai [15], and Moes- lund and Granum [11] provide surveys of human motion tracking and analysis methods. Sigal and Black [12] provide a recent survey on human pose estimation. We list some of the important monocular and multicamera techniques in Sec- tions I-A1 and I-A2 followed by a brief discussion of their limitations. 1) Monocular Methods: As mentioned earlier, monocular techniques suffer from a range of problems that mark them as unsuitable for markerless motion capture. Monocular methods can be classified according to the image cues used ranging from edges [2] and silhouettes [16] to 2-D image motion [4], [3]. In [17], the pose vector is estimated using support vector machines and kinematic constraints. The issue of model acquisition and 1 By frame, we refer to image(s) obtained at a single time instant; it is one image in the monocular case and a set of images in the multicamera case. 1057-7149/$26.00 © 2009 IEEE

Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

  • Upload
    r

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2114 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

Multicamera Tracking of Articulated Human MotionUsing Shape and Motion Cues

Aravind Sundaresan and Rama Chellappa, Fellow, IEEE

Abstract—We present a completely automatic algorithm forinitializing and tracking the articulated motion of humans usingimage sequences obtained from multiple cameras. A detailedarticulated human body model composed of sixteen rigid segmentsthat allows both translation and rotation at joints is used. Voxeldata of the subject obtained from the images is segmented intothe different articulated chains using Laplacian Eigenmaps. Thesegmented chains are registered in a subset of the frames usinga single-frame registration technique and subsequently used toinitialize the pose in the sequence. A temporal registration methodis proposed to identify the partially segmented or unregisteredarticulated chains in the remaining frames in the sequence. Theproposed tracker uses motion cues such as pixel displacementas well as 2-D and 3-D shape cues such as silhouettes, motionresidue, and skeleton curves. The tracking algorithm consists ofa predictor that uses motion cues and a corrector that uses shapecues. The use of complementary cues in the tracking alleviates thetwin problems of drift and convergence to local minima. The useof multiple cameras also allows us to deal with the problems dueto self-occlusion and kinematic singularity. We present trackingresults on sequences with different kinds of motion to illustrate theeffectiveness of our approach. The pose of the subject is correctlytracked for the duration of the sequence as can be verified byinspection.

I. INTRODUCTION

H UMAN pose estimation and tracking from video se-quences, or motion capture, has important applications in

a variety of fields such as biomechanical and clinical analysis,human computer interaction, and animation. Current techniquesuse marker-based techniques, which involve the placement ofmarkers on the body of the subject and capturing the movementof the subject using a set of specialized cameras. The use ofmarkerless techniques eliminates the need for the specializedequipment as well as the expertise and time required to placethe markers. It can also potentially measure the pose usinganatomically appropriate models rather than estimating themfrom a set of markers. Different applications have differentneeds and use single or multiple cameras to estimate humanpose and this problem has received much attention in the imageprocessing and computer vision literature in both the monoc-ular [1]–[4] and multiple camera cases [5]–[9]. A survey of a

Manuscript received April 27, 2008; revised March 26, 2009. First publishedMay 05, 2009; current version published August 14, 2009. This work was sup-ported in part by NSF ITR 0325715. The associate editor coordinating the re-view of this manuscript and approving it for publication was Prof. Scott T.Acton.

A. Sundaresan is with SRI International, Menlo Park, CA 94025 USA(e-mail: [email protected]).

R. Chellappa is with the Center for Automation Research, University ofMaryland, College Park, MD 20742 USA.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2009.2022290

number of important pose estimation methods developed in thepast decade may be found in [10]–[12].

The typical steps in motion capture [13] are (1) model estima-tion, (2) pose initialization, and (3) tracking. Model estimationis the process of estimating the parameters of the human bodymodel such as the shape of the body segments and their articu-lated structure. Pose initialization refers to the estimation of thepose given a single frame.1 Pose tracking refers to the estimationof a pose in the next frame, given the pose in the current frame.Both (2) and (3) perform pose estimation, but the methods em-ployed are usually different and we list them separately.

The articulated structure of the human body which is com-posed of a number of segments, each with its associated shapeand pose, makes human pose estimation a challenging task. Thecomplexity of the human body and the range of poses it can as-sume necessitate the use of a detailed model in order to representits pose and to guide pose estimation. Body models typically in-corporate both the shape of individual body parts and structuralaspects such as the articulated connectivity and joint locationsof the human body. Besides the sheer complexity of the humanbody, a common problem faced in image-based methods is thatsome parts of the body often occlude other parts (self-occlu-sion). It is, therefore, difficult to perceive and estimate motionin the direction perpendicular to the image plane when usingimages from a single camera. Morris and Rehg [14] refer tothis problem as “kinematic singularity” and study it in somedetail [14]. Monocular techniques suffer from the above prob-lems of self-occlusion and “kinematic singularities” and mul-tiple cameras are required to estimate pose in a robust and ac-curate manner.

A. Related Work

Gavrila and Davis [10], Aggarwal and Cai [15], and Moes-lund and Granum [11] provide surveys of human motiontracking and analysis methods. Sigal and Black [12] providea recent survey on human pose estimation. We list some ofthe important monocular and multicamera techniques in Sec-tions I-A1 and I-A2 followed by a brief discussion of theirlimitations.

1) Monocular Methods: As mentioned earlier, monoculartechniques suffer from a range of problems that mark them asunsuitable for markerless motion capture. Monocular methodscan be classified according to the image cues used ranging fromedges [2] and silhouettes [16] to 2-D image motion [4], [3]. In[17], the pose vector is estimated using support vector machinesand kinematic constraints. The issue of model acquisition and

1By frame, we refer to image(s) obtained at a single time instant; it is oneimage in the monocular case and a set of images in the multicamera case.

1057-7149/$26.00 © 2009 IEEE

Page 2: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

SUNDARESAN AND CHELLAPPA: MULTICAMERA TRACKING OF ARTICULATED HUMAN MOTION USING SHAPE AND MOTION CUES 2115

Fig. 1. Schematic of the motion capture algorithm with the three steps in the dashed boxes on the right and the different cues used on the left. The contributionin this paper and [8] is clearly marked.

initialization using images from a single camera is addressed in[18]. The problems of self-occlusion and kinematic ambiguitiesin monocular video have been addressed with limited successin [19]–[21]. Many of the monocular techniques predate multi-camera methods and some have been extended to multiple cam-eras.

2) Multicamera Methods: Multicamera methods can alsobe broadly classified as shape-based and motion-based. Shape-based methods use 2-D shape cues such as silhouettes or edges[22]–[24] or 3-D shape cues such as voxels [5]–[8]. The voxelrepresentation of a person provides cues about the 3-D shapeof the person and is often used in pose estimation algorithms.Motion-based methods [25]–[27] typically use optical flow inthe images to perform tracking. The motion-based methods es-timate the change in pose and typically assume that the initialpose is available. On the other hand, shape-based methods useabsolute cues and can be used to both initialize the pose given asingle frame [5]–[8], or perform tracking [23], [24], [28], [29].We first list automatic or semi-automatic methods to estimateand track the pose followed by some important multicameratechniques.

Mikic et al. [6] and Mündermann et al. [7] perform allthe steps in the motion capture using voxel based techniques.They are, however, limited by the shortcomings of shape-basedmethods and in the case of [7], the model is not obtained auto-matically. Chu et al. [5] use volume data to acquire and tracka human body model and Cheung et al. [9] use shapes fromsilhouette to estimate human body kinematics. However, in [5],no tracking is performed, while in [9], the subject is required toarticulate one joint at a time in order to initialize the pose.

The following techniques assume that an initial pose esti-mate is available and perform tracking using shape and mo-tion cues. Yamamoto et al. [26] track human motion using mul-tiple cameras and optical flow. Bregler and Malik [27] also useoptical flow and an orthographic camera model. Gavrila andDavis [30] discuss a multiview approach for 3-D model-basedtracking of humans in action. They use a generate-and-test al-gorithm in which they search for poses in a parameter spaceand match them using a variant of Chamfer matching. Kaka-diaris and Metaxas [22] use silhouettes from multiple camerasto estimate 3-D motion. Delamarre and Faugeras [23] use 3-Darticulated models for tracking with silhouettes. They use sil-houette contours and apply forces to the contours obtained fromthe projection of the 3-D model so that they move towards the

silhouette contours obtained from multiple images. Moeslundand Granum [24] perform model-based human motion captureusing cues such as depth (obtained from a stereo rig) and theextracted silhouette, while the kinematic constraints are appliedin order to restrict the parameter space in terms of impossibleposes. Sigal et al. [28], [29] use nonparametric belief propaga-tion to track in a multicamera set up.

Motion-based trackers suffer from the problem of drift; i.e.,they estimate the change in pose from frame to frame and asa result the error accumulates over time. On the other hand,shape-based methods rely on absolute cues and do not face thedrift problem but it is not possible to extract reliable shape cuesin every frame. They typically attempt to minimize an objec-tive function (which measures the error in the pose) and areprone to converge to incorrect local minima. Specifically, back-ground subtraction or voxel reconstruction errors in voxel-basedmethods result in cases where body segments are missing or ad-jacent body segments are merged into one. We note that shapecues and motion cues are complementary in nature and it wouldbe beneficial to combine these cues to track pose. We brieflydescribe our algorithm and discuss how it addresses the abovelimitations in the following section.

B. Algorithm Summary

We present a detailed articulated model and algorithms for es-timating the human body model and initializing and tracking thepose in a completely automatic manner. The work presented inthis paper is the second part of a larger body of work, namely, thecomplete automatic motion capture system. In order to place thiswork within the context of the larger body of work we first de-scribe the contribution of our earlier paper titled “Model drivensegmentation and registration of articulating humans in Lapla-cian eigenspace” [8] and then explain the contribution in thispaper. The overview of our complete motion capture system isillustrated in Fig. 1.

In [8], we present an algorithm for segmenting volumetricrepresentations (voxels) of the human body by mapping themto Laplacian Eigenspace. We also describe an application ofthis algorithm to human body model and pose estimation andprovide experimental validation using both synthetic and realvoxel data. Some of the key results of the above algorithm areillustrated in Fig. 2. Given a sequence of 3-D voxel data ofhuman motion [Fig. 2(a)], the human body model and pose[Fig. 2(b)–(c)] are estimated using a sub-set of the frames in the

Page 3: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2116 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

Fig. 2. Illustration of skeleton and super-quadric model. (a) Voxel data.(b) Skeleton model. (c) Corresponding super-quadric model.

sequence. The human body model consists of rigid segmentsconnected in an articulated tree structure. There are six articu-lated chains (the trunk, head and four limbs). The pose of eachrigid segment is represented in general using a 6-vector (3 de-grees of freedom for translation and 3 for rotation). However,in our work we constrain most of the joints to possess only ro-tational motion (3 degrees of freedom). The full body pose, ,is represented in a parametric form as a stacked vector of theposes of each segment. The concept of segmentation in Lapla-cian Eigenspace was introduced in [31] and its application tohuman body model estimation in [32]. We describe the theo-retical underpinnings of segmentation in Laplacian Eigenspacein [8] and compare it to other spectral methods such as Isomap[33] with respect to the particular application of human bodysegmentation.

In this paper, we present an algorithm for tracking articulatedmotion of humans using shape and motion cues. We integratethe tracking step with model estimation and pose initialization tobuild a completely automatic motion capture system, the blockdiagram of which is illustrated in Fig. 1. We note that the poseinitialization algorithm typically works in only a fraction of theframes in a sequence. The failures are typically due to both er-rors in the processing of the shape cues (e.g., voxel reconstruc-tion) and in general the complexity of the pose itself. A trackingmodule is, therefore, essential to complete the motion capturesystem. The contributions in this paper are enumerated below.

1) We introduce a temporal registration algorithm for usingunidentified skeleton curves.

2) We describe a framework for combining 2-D shape cuessuch as pixel displacement, silhouettes and motion residuesas well as 3-D cues to track the pose in the sequence.

3) We describe a technique for smoothing the pose to improvethe tracking estimates.

4) We integrate the model estimation and pose initializationmodules [8], [32] with the pose tracking modules to realizea completely automatic motion capture system and presentresults on real data sequences.

It can be expected that using a single type of image featureleads to a single point of failure in the algorithm, and, hence, itis desirable to use different kinds of shape and motion featuresor cues. Our algorithm uses both motion cues in the form ofpixel displacements as well as 2-D and 3-D shape cues such asskeleton curves, silhouettes and “motion residues.” These aredescribed in detail in Sections II and III. Thus, our algorithm

does not have a single point of failure and is robust. Trackerswhich use only motion cues suffer from the drift problem dueto an accumulation of the tracking error. On the other hand,trackers that use shape cues which are absolute often involve anenergy minimization formulation and can converge to the wronglocal minima. The motion and shape cues when combined workto alleviate the drift and local minima problem that are manifestwhen they are applied separately.

Since we use motion and shape cues in our tracking algo-rithm, we are able to better deal with cases where the body seg-ments are close to each other such as when the arms are bythe side of the body in a typical walking posture. Purely sil-houette-based methods, including those that use voxels, expe-rience difficulties in such cases. Indeed, we use a voxel-basedalgorithm to initialize the pose and initiate the tracking, but theregistration algorithm used in the initialization fails in a numberof cases where the body segments are too close to each otheror when errors in the 2-D silhouette estimation cause holes andgaps in voxel reconstruction. Silhouette or edge-based methodsalso have problems estimating rotation about the axis of thebody segment as it is impossible to detect motion of a sphere orcylinder rotating about their axis by observing only their silhou-ettes. We also propose an optional smoothing step that smoothsthe trunk pose, improving the performance of our tracker.

In our experiments, we use eight cameras that are placedaround the subject. While the tracking algorithm works withfewer than eight cameras, we need at least eight cameras toobtain reasonable voxel reconstruction for the purpose of poseinitialization. A visual inspection of the voxel reconstructionobtained using fewer than eight cameras was found to contain“ghost” limbs in a number of frames and was of a poorer qualityand unsuitable for pose estimation as was also noted by Mün-dermann et al. [34]. We note that the prediction module of ourtracker requires that the motion between frames be small enoughso that pixel displacement can be estimated and the iterativepose estimation algorithm converges. We observe in our exper-iments that a frame rate of 30 fps suffices for normal humanwalking motion. The results of the tracking algorithm in se-quences with different motions such as swinging arms in widearcs, walking in a straight line and walking in circles are pre-sented. The algorithm proposed in this paper can be used in anumber of biomechanical applications, such as gait analysis aswell as general human motion analysis.

The organization of the paper is as follows. We describe ourhuman body model, corresponding pose vector and its estima-tion in Section II. The details of the algorithm are presented inSection III. We describe pose initialization, temporal spline reg-istration and then describe the two-step tracking process. Wealso describe the smoothing step. Finally, we present results ofour algorithm on three different sequences using real imagescaptured from eight cameras in Section IV.

II. HUMAN BODY MODEL, POSE, AND TRACKING

We briefly describe our human body model in Section II-Aand reconstruction of the subject using the body model parame-ters and the pose vector in Section II-B. We also describe someof the key modules of our tracking algorithm. The linear rela-tion between the pixel velocity and pose velocity is derived and

Page 4: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

SUNDARESAN AND CHELLAPPA: MULTICAMERA TRACKING OF ARTICULATED HUMAN MOTION USING SHAPE AND MOTION CUES 2117

Fig. 3. Articulated structure and the relative positions of body segments asa function of the body model and pose. The red, green, and blue axis set de-scribe the pose of the coordinate frame for each segment. (a) Wire-frame model.(b) Underlying articulated structure.

the estimation of the change in pose from pixel displacementsis described in Section II-C. The manner in which we use 3-Dskeleton curves for tracking is described in Section II-D.

A. Human Body Model

We model the human body as consisting of six articulatedchains, namely the trunk (lower trunk, upper trunk), head (neck,head), two arms (upper arm, forearm, palm) and two legs (thigh,leg, foot) as illustrated in Fig. 2(c). Our model is based on theunderlying skeletal structure and flexibility of the human body.Each rigid segment is represented by a tapered super-quadric.The model consists of the joint locations and parameters ofthe tapered super-quadrics describing each rigid segment. Themodel can be simplified to a skeleton model using just the axisof the super-quadric as illustrated in Fig. 2(b). The recovery ofthe human body model is described in detail in [8], [32].

B. Description of Pose Vector

Let be a transformation matrix in homogeneous 3-Dcoordinates consisting of a rotational component, , anda translational component, . The pose vector for a singlebody segment consists of both components and is given by

. is expanded as

where

(1)

The (hat) operator is described in [35] and maps a 6 1pose vector to the corresponding 4 4 coordinate transform ma-trix. The (vee) operator is the inverse of the operator andmaps a 4 4 coordinate transform to a 6 1 pose vector, i.e.,

. We hereafter drop the dependency on pose vectorfor brevity. The articulated nature of the body is illustrated inFig. 3. The lower trunk is the root of the kinematic chain andall body segments are attached to the root in a kinematic chain.Each body segment has six degrees of freedom (DoF) in generaland its pose relative to its parent is described using the abovepose vector.

We first define a world coordinate frame that is fixed for theentire experiment. The full body pose is computed with respect

to this coordinate frame. The pose of each body segment is de-scribed by a combination of body model and pose parameters.We use the superscripts and to denote model structure andpose parameters respectively. For instance, is a joint loca-tion and is part of the body model, while is the translationalpose at the joint and is part of the pose vector. Consider two seg-ments, and in Fig. 3, where segment is the parent ofsegment . Segment is connected to its parent at joint , whoselocation is given by in the coordinate frame of the parent.We hereafter use the word “frame” as an abbreviation for “coor-dinate frame.” The pose of segment with respect to its parent

(segment ) is . The complete transforma-

tion between segment and segment is, therefore, givenby

(2)

represents a transformation matrix of a point from the coor-dinate frame of segment to the coordinate frame of segment .

represents the transformation matrix of the root of the kine-matic chain (index 1) with respect to the world coordinate frame(index 0). The pose of the th segment in Fig. 3 with respect tothe world coordinate frame is, therefore, given by

(3)

For a strictly articulated body, the translation component of thepose at all joints is zero, i.e., . However, weallow limited translation at certain joint locations such as theshoulder (a “compound” joint that cannot be represented by asingle rotational joint) to better model its complexity. We set

where denotes special joints. Our humanbody model consists of sixteen rigid segments. The pose of seg-

ment is given by and the full body pose is

given by .

C. Tracking Pose Using Pixel Displacement

In this section, we describe how we estimate the change inpose of an articulated body using pixel displacement. To beginwith, we introduce the 6 1 pose velocity vector . We can de-scribe any six DoF transformation matrix at time as [35].The pose velocity is so called as it can be considered as the in-stantaneous velocity of the pose vector. The instantaneous posevelocity at is given by

(4)

We derive the relation between the pose velocity vector and the3-D pixel velocity as well as the 2-D pixel velocity. Finally wedescribe the algorithm to estimate the pose velocity from thepixel displacement. We denote the transformation of frame(attached to segment ) with respect to frame (attached to seg-ment ) at time by . We can then express as

(5)

Page 5: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2118 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

where is defined earlier. We use to denote timevarying matrices and to denote constant matrices. We notethat can be expanded using the Taylor series as

(6)

Let us consider a point attached to segment . Its coordinatesin frame and frame are given by and , respectively.We then have

(7)

We consider motion at without loss of generality. Consid-ering the instantaneous velocity of the point in frame , wehave

(8)

where the second equation follows because the point is attachedto frame and, therefore, . Substituting (5) in (8), weget (9). (10) follows where is defined in (11)

(9)

(10)

(11)

Assuming there are a total of segments, and given a pointon the th segment, we have

(12)

It follows that

(13)

(14)

(15)

(16)

... (17)

(18)

where and follows from(17).

Let the pose at time andbe and

, respectively. Thepose at for each segment in the body is then given by

(19)

We can represent the set of operations (19) using the abbreviatednotation

(20)

where the upper-case Greek letters refer to the vectorstack of the poses of the individual segments represented bylower-case Greek letters .

We have shown in [36] that if we use a perspective projectionto project a 3-D point on to the camera, the resulting pixel ve-locity is still a linear function of the pose velocity. Let

be the projection matrix, then the pixel location in homo-

geneous coordinates is given by and the actualpixel coordinates are given by as

(21)

We obtain the pixel velocity by differentiating (21) as

(22)

(23)

(24)

We represent the matrix in (22) as in (23). We thus com-bine (18) and (23) to express the 2-D pixel velocity as a linearfunction of the 3-D pose velocity, , in (24). We can estimatefrom the pixel velocity by inverting (24), then we can use (20)to compute the new pose from . However, we can onlymeasure pixel displacement from the images, and, hence, we usea first-order approximation of the pixel velocity.

Given a set of points, we compute the projection of each ofthese points for all the cameras using the pose . We call thisstacked vector . We also compute the pixel displacementmatrix . and are functionsof both the 3-D point coordinates and the projection matricesbesides , but as these are fixed for a given frame, we do notexplicitly denote them for the sake of simplicity. For a set ofpoints, we, therefore, have

...... (25)

The state vector in our state-space formulation is and thestate update and observation equations are given by (26) and(27), where is the measurement noise

Page 6: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

SUNDARESAN AND CHELLAPPA: MULTICAMERA TRACKING OF ARTICULATED HUMAN MOTION USING SHAPE AND MOTION CUES 2119

(26)

(27)

We note that our system is similar to an Iterated extendedKalman filter but the plant noise in our case is multiplica-tive and it is not straight-forward to extend the IEKF in oursystem. Equation (27) follows from the first order Taylor seriesapproximation

(28)

We then use Algorithm 1 to estimate given and thepixel displacement between and . We have several pixeldisplacement measurements from multiple cameras and the es-timation equation is highly over-constraine, and, therefore, wecan obtain a least squares estimate. We also find that the mul-tiview constraints are able to overcome kinematic singularitiesand occlusions which are the bane of monocular pose estima-tion. We observe that we compute the change of pose be-tween two frames. The constant assumption is perfectly validwhen we consider only two frames. The approximation we makeis that we estimate the pose velocity which is a first order ap-proximation of the pixel velocity which should be used. We usean iterative estimation algorithm to compensate for the approxi-mation but we note that a higher frame rate is required for fastermotions than those presented in the experiments.

Algorithm 1 Compute 3-D Pose From Pixel Displacement

Require: Pose at time and pixel displacement betweenand

1: set2: for maximum iterations-1 do3: let4: compute5: update6: end for7: set

D. Tracking Pose Using Skeleton Curves

In this section, we describe a key module in the tracking ofthe pose using 3-D shape cues (skeleton curves). As describedearlier and illustrated in Fig. 4, we can segment voxel data [Fig.4(a)] into different articulated chains and register them to thehuman body model [Fig. 4(b)]. We can obtain a skeleton curvefor each segmented articulated chain represented by uniformly-spaced points on the curve [Fig. 4(c)]. The skeleton model corre-sponding to the estimated pose is illustrated in Fig. 4(d). In orderto determine the pose that best fits the skeleton curve, we firstdefine a distance measure between the skeleton curve [Fig. 4(c)]and the skeleton model [Fig. 4(d)] computed from the pose. Weassume that an initial estimate of the pose is available so thatwe can iteratively refine the estimate to minimize this distance.We note that the skeleton curve for a frame consists of six curvesregistered to the six articulated chains of the human body model.

Fig. 4. From voxels to the skeleton curve and skeleton model: (a) Voxels, (b)segmented voxels, (c) skeleton curve (d) skeleton model.

Fig. 5. Computing distance between skeleton curve and skeleton model:(a) denotes sample points on the skeleton curve; (b) denotes the distance to theclosest point on the skeleton model after optimization.

We compute the distance by considering each chain indepen-dently as described in the following paragraph.

Consider a set of ordered points , on askeleton curve corresponding to the arm (see Fig. 5). Theskeleton model for the arm consists of three line segments:

and . We compute the distance, , between andthe closest point on line segment and assign each pointto a line segment. Since the set of points on the skeletoncurve is ordered, we impose the constraint that the aboveassignment is performed in a monotonic manner, i.e., points

are assigned to , points areassigned to and points are assigned to

. For a given value of is chosen so that the dis-tance between points and is equal to the lengthof line segment . For the above assignment, the dis-tance between the skeleton curve is given by the vector

. The3-D pose of the articulated chain as well as indices andare chosen so as to minimize the sum of the elements in thevector which is given by

(29)

III. ALGORITHM

We present in this section the details of our pose trackingalgorithm including the preprocessing and pose initializationsteps. The preprocessing includes using images obtained frommultiple cameras to compute silhouettes and voxels and is de-scribed in Section III-A. The parameters of the human bodymodel are computed as described in [32]. We assume that we are

Page 7: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2120 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

Fig. 6. Processing images to compute silhouettes and voxels.

able to perform single-frame registration in at least one framein the sequence and initialize the pose using the method de-scribed in [31]. Typically, the pose can be initialized in someof the frames in the sequence but the single-frame registrationis unsuccessful in the majority of the frames and we are leftwith unregistered skeleton curves. We propose a temporal reg-istration scheme by which means we register skeleton curvesby exploiting their temporal relation. Methods for pose initial-ization and temporal skeleton curve registration are describedin Section III-B. We then describe our tracking algorithm thattracks the pose in two steps; the prediction step using motioncues, and the correction step using 2-D and 3-D shape cues inSection III-C. We also describe an optional smoothing step inSection III-D.

A. Preprocessing

We use images obtained from calibrated cameras. Weperform simple background subtraction to obtain foregroundsilhouettes as shown in Fig. 6. In order to compute the voxels,we project points on a 3-D grid (in the volume of interest) toall the camera images. All points that are projected to imagecoordinates that lie inside the silhouette in all the images areconsidered to be part of the subject. In general, we could con-sider points that lie inside the silhouette in at least im-ages, where could take values depending on thenumber of cameras in use. A nonzero value of lends robust-ness to background subtraction errors if there are a large numberof cameras . We set in our experiments. Thevoxel reconstruction results using the silhouettes in Fig. 6(b) arepresented in Fig. 6(c).

B. Pose Initialization and Temporal Registration

We perform segmentation of the voxel data using LaplacianEigenmaps to obtain the different articulated chains [8], [31].The method maps voxels on each articulated chain to a smooth1-D curve in Laplacian Eigenspace. We can then segment voxelsas belonging to different curves (or articulated chains) and alsoregister them. For each segmented articulated chain we computethe skeleton curve using smoothing splines as described in [31].The method to initialize the pose of the subject using the regis-tered skeleton curve is presented in Section III-B1. An exampleof a successfully segmented and registered frame is presented inFig. 7. However, the single frame registration method does notsucceed in all frames due to errors in voxel reconstruction orsegmentation, examples of which are presented in Figs. 8 and 9.

We present a temporal registration algorithm to register skeletoncurves in such frames in Section III-B2.

1) Pose Initialization: The pose is initialized for a com-pletely registered frame as follows. The skeleton curve issampled at regular intervals to obtain a set of ordered pointsfor each body chain (trunk, head, two arms and two legs). Thesampled skeleton curve is illustrated in Fig. 7(c). We choosean intersample distance of 20 mm as a trade-off between thecomputational cost of denser sampling and the poor spatialresolution of sparser sampling.

The pose is computed using the skeleton curves and is initial-ized in two steps. First, the pose of the trunk is determined andsecond, the pose of the remaining five articulated segments iscomputed. The -axis of the trunk is aligned with the skeletoncurve of the trunk as marked in Fig. 7(d). The -axis of thetrunk is parallel to the right-left vector which is set to be theaverage of the vectors from the right to left shoulder joint andfrom the right to left pelvic joint on the skeleton curve markedin Fig. 7(d). The -axis points in the forward direction which isdetermined using the direction of the feet and is orthogonal tothe computed plane. The axis orientation that describesthe pose of the trunk is illustrated in Fig. 7(e). Once the trunkpose has been estimated, the joint locations at the hips, shoul-ders and neck are fixed. It is then possible to estimate the poseof each of the articulated chains independently. The objective isto compute the pose of the skeleton model so that the distancebetween the points on the skeleton curve and the skeleton modelis minimized as described in Section II-D. The initial estimateof the pose is illustrated in Fig. 7(f).

2) Temporal Skeleton Curve Registration: Two exampleswhere registration of skeleton curves to articulated chains ina single frame fails are illustrated in Figs. 8 and 9. In one ofthe examples, the head is missing due to errors in backgroundsubtraction. In the other seven skeleton curves are discoveredinstead of six. We, therefore, introduce a temporal registrationscheme which exploits the proximity of the skeleton curvesbelonging to the same body segment in temporally adjacentframes. Given two frames at and where single-frameregistration is successful, we perform temporal registration in allthe frames between and . Letand be the set of points belongingto skeleton curves and respectively. The distancebetween skeleton curves and is given by

(30)

Let and represent the unregistered and registered skeletoncurves for the th articulated chain at time instant . The tem-poral skeleton curve registration algorithm is listed in Algorithm2. We typically set mm. This threshold isbased on the maximum distance a chain can move between twoframes and the intra curve distance within a given frame.

Algorithm 2 Temporal Skeleton Curve Registration

Page 8: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

SUNDARESAN AND CHELLAPPA: MULTICAMERA TRACKING OF ARTICULATED HUMAN MOTION USING SHAPE AND MOTION CUES 2121

Fig. 7. Example of registered frame: The various stages from segmentation and registration to pose initialization.

Fig. 8. Unregistered frame with missing head.

Fig. 9. Unregistered frame with extra segment.

Require: Registered skeleton curves at time : for.

Require: Unregistered skeleton curves: for.

1: for time do2: for each do3: find the closest curve, , if it exists, such that

4: end for5: if each curve is mapped to a different then6: registration is successful.7: for each do8: if has a registered candidate then9: set

10: else11: set .12: end if

13: end for14: end if15: end for

We use Algorithm 2 to perform reverse temporal registrationas well, i.e., we start at and proceed backwards in time to

. The reverse registration is necessary because if there isa gap in the forward temporal registration for any of the six ar-ticulated chains, then that chain is unlikely to be registered fromthat point onwards. Any skeleton curve that is not registered tothe same articulated chain in the forward and reverse temporalregistration process is not used in the tracking.

C. Pose Tracking

Our tracking algorithm consists of a prediction step and a cor-rection step that are described in Sections III-C1 and III-C2, re-spectively. The overview of the tracking algorithm is presentedin Algorithm 3.

Algorithm 3 Tracking Algorithm

1: for time do2: /* predict pose at time */3: compute 2-D pixel displacement between frames

and4: estimate 3-D pose using pixel displacement5: /* correct pose at time */6: for chain do7: if 3-D shape cues are available for chain then8: correct pose using 3-D shape cues (skeleton curves).9: else

10: correct pose using 2-D shape cues (silhouettes andmotion residues).

11: end if12: end for13: end for

1) Prediction Using Motion Cues: In order to estimate themotion of each of the body segments, we first project the bodysegment onto each image. We call this step pixel-body regis-tration. We then compute the pixel displacement for each bodysegment in each image using the motion model for a rigid seg-ment. The pixel displacement for a set of bodies in all the imagesis then stacked in a single matrix equation which we use to es-timate the change in 3-D pose.

Page 9: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2122 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

Fig. 10. Pixel registration: (a) view 1, (b) view 2.

Fig. 11. (a) Smoothed image with the foreground mask. (b) Motion residue for��� � �. (c) The estimated pixel displacement for the mask. (d) Motion residuefor estimated ��� that results in the pixel displacement in (c).

Fig. 12. We estimate the motion from the base of the kinematic chain, i.e., thetrunk and propagate the motion in further steps. The segments for which thepose is computed at a given stage is colored in black. (a) Step 1. (b) Step 2. (c)Step 3.

Fig. 13. Obtaining unified error image for the forearm. (a) The silhouette attime ���. (b) A magnified view of the silhouette. (c) The motion residue at time�. (d) The combined error image. (e) Error image with the mask correspondingto the segment whose pose we are trying to correct.

a) Pixel-Body Registration: In order to determine the cor-respondence between each pixel and the body segments we rep-resent each body segment as a triangular mesh and project itonto each image. The depth at each pixel is determined by in-terpolating the depths of the triangle vertices. When multiple

Fig. 14. Minimum error configuration: It does not matter if the object is oc-cluded or nearly occluded in some of the images. (a) Cam. 1, (b) Cam. 2, (c)Cam. 3, (d) Cam. 4, (e) Cam. 6.

Fig. 15. Position of the subject in the world coordinate frame in the three se-quences.

segments are projected onto the same pixel the registration am-biguity is resolved by using the depth. We thus register eachpixel to the body segment it belongs to and determine its 3-Dcoordinates. Fig. 10 illustrates the projection of the body seg-ments onto images from two cameras. Different colors denotedifferent body segments.

b) Estimating Pixel Displacement: We use a parametricmodel for the motion of pixels belonging to the same segmentin each image. The displacement at pixel is a function of

where is the displacement, is the rotation and isthe scale parameter and is given by

(31)

where is the 2-D location of the joint of the body segment.We find that the above parametric representation is more robustthan an affine motion model and we can also set meaningfulupper and lower bounds on each parameter. Let bethe pixels registered to a given segment. We compute the valueof for each segment that minimizesthe residue given by . The value denotes the imposedbounds on the motion and the th element of (the vector ofpixel residues) is given as

(32)

where is the observed image at time in one of the cameras.A value of for means there is no motion.

Fig. 11 illustrates the pixel displacement computations. Oncewe have estimated the motion parameters, we can estimate thecorresponding “motion residue” which is defined as the differ-ence between and warped according to the estimatedmotion. If the actual motion of a pixel agrees with the estimated

Page 10: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

SUNDARESAN AND CHELLAPPA: MULTICAMERA TRACKING OF ARTICULATED HUMAN MOTION USING SHAPE AND MOTION CUES 2123

Fig. 16. Raw and smooth translational components represented by dots and lines, respectively. (a) � (� component), (b) � (� component), (c) � (� component).

motion, then the motion residue for that pixel is zero and other-wise it is some nonzero value. We note that the motion[Fig. 11(b)] agrees with the motion of the stationary backgroundpixels. However, it does not agree with the motion of the fore-ground pixels. Fig. 11(c) denotes the estimated pixel displace-ment for the body segment under consideration. Fig. 11(d) is themotion residue for the estimated . We note that the estimatedmotion agrees with the actual motion for the foreground pixels(in the mask) but not for the background pixels, i.e., the motionresidue for the pixels in the mask is almost zero. Thus, the mo-tion residue provides us with a rough delineation of the locationof the body segment, even when the original mask does not ex-actly match the body segment.

c) Pose Prediction: We predict the pose at timegiven the pose at time and the pixel displacement computedabove as described in Algorithm 1 in Section II-C. While ourbody model and our pose estimation algorithms allow rotationand translation at each joint, we set the translational componentto zero at most joints as we find that in practice the estimationis more robust when the number of translational parameters issmall [37]. We estimate the pose of the subject in multiple steps,starting at the root of the kinematic chain as illustrated in Fig. 12.In the first step, we estimate the pose for the segments belongingto the trunk, in the second we include the first segment in all thearticulated chains, and in the final step we estimate the pose forall the segments except the trunk. The translational componentof all segments is set to zero except for the base body and theshoulder joint. The base body is allowed to translate freely andthe translation at the shoulder joint, is constrainedso that mm.

2) Correction Using 3-D and 2-D Shape Cues: The pose canbe corrected for all the articulated chains in a given frame thathave been registered using 3-D shape cues (skeleton curves).The pose parameter search space is centered and boundedaround the pose predicted using motion cues. In the absenceof 3-D shape cues, we use 2-D shape cues in the form ofsilhouettes and motion residues. Thus, the algorithm adaptsitself to use available spatial cues.

We have observed earlier that the motion residue for a givensegment provides us with a region that helps us to spatially de-lineate the segment. We now combine it with the silhouette asillustrated in Fig. 13 to form an error image for that segment.The error image is the sum of the silhouette and motion residueand is computed for each camera and each segment along witha mask for the body segment as illustrated in Fig. 13(e). Thiserror image can be used to compute an objective function in

terms of the 3-D pose of the segment. Given any 3-D pose ofthe segment we can project the segment onto each image to ob-tain a mask for the segment [Fig. 13(e)]. The objective functionis computed by summing the pixels of the error image that liein the mask. Our estimate of the 3-D pose is the value that min-imizes this objective function in all the images. The objectivefunction is optimized in a pose parameter space that is centeredaround the predicted pose using the lsqnonlin function in theMatlab nonlinear optimization toolbox. We illustrate the resultsof pose correction for the above example in Fig. 14. The red linerepresents the initial position of the axis of the body segment andthe cyan line represents the final position. The final mask loca-tion is denoted in blue and we note that it is well aligned withthe silhouette.

D. Smoothing

It is often beneficial to perform temporal smoothing on thepose vector as it typically improves the performance of the al-gorithm. We propose an optional smoothing step that acts on thepose of the root segment of the kinematic chain. It is difficult tosmooth the entire pose vector due to the articulated constraintsbetween the segments, and we therefore restrict the smoothingto the pose of the trunk segment (root) as it has an impact on thepose of all the body segments. We smooth the pose estimatedfrom the skeleton curves using the smoothing spline functioncsaps in the Matlab Spline Toolbox. The trunk location is inter-polated for frames missing the trunk skeleton curve. The trans-lational components of the pose of the trunk for one of the testsequences is presented in Fig. 16. The translational componentsare given by

IV. EXPERIMENTAL RESULTS

We performed tracking on sequences where the subject per-forms different kinds of motion. The experiments were per-formed using gray-scale images obtained from eight cameraswith a spatial resolution of 648 484 at a frame rate of 30frames per second. The external and internal camera calibrationparameters for all the cameras were obtained using the cameracalibration algorithm of Svoboda [38] and a simple calibrationdevice to compute the scale and the world coordinate frame.We present results for three sequences that include the subjectwalking in a straight line (65 frames, 2 s) in Fig. 17, swinging

Page 11: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2124 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

Fig. 17. Tracking results for sequence 1. (a) Images from camera 1, (b) images from camera 3.

Fig. 18. Tracking results for sequence 2. (a) Images from camera 1, (b) images from camera 3.

Fig. 19. Tracking results for sequence 3. (a) Images from camera 1, (b) images from camera 3.

the arms in wide arcs (300 frames, 10 s) in Fig. 18, and walkingin a circular path (300 frames, 10 s) in Fig. 19. Fig. 15 illus-trates the motion of the base body in the world coordinate framein the three sequences. Our experiments show that using onlymotion cues for tracking causes the pose estimator to drift andlose track eventually, as we are estimating only the difference inthe pose. This underlines the need for “correcting” the predictedpose using spatial cues and we observe that the “correction” stepof the algorithm prevents drift in the tracking. We illustrate theresults of the tracking algorithm by super-imposing the trackedbody model onto the image for two of the eight cameras. Theestimated pose of the subject is super-imposed on the imagesand the success of the tracking algorithm is determined by vi-sual inspection. It is not possible to obtain an objective measureof the pose as the actual pose is not available. The full bodypose is successfully tracked in the three sequences as can be ob-served in the super-imposed video sequences. Selected framesfrom different cameras are presented in Figs. 17–19.

V. CONCLUSION

We presented a complete pose initialization and tracking al-gorithm using a flexible and full human body model that al-lows translation at complex joints such as the shoulder joint.The human body model is automatically estimated from the se-quence using the algorithm presented in [8], [32]. Pose initial-ization is performed based on single frame segmentation andregistration [8], [31] of voxel data. An algorithm to performtemporal registration of partially segmented voxels for trackingwas also suggested. We used both motion cues and shape cuessuch as skeleton curves obtained from bottom-up voxel segmen-tation as well as silhouettes and “motion residues” to performthe tracking. We presented results on sequences with differentkinds of motion and observe that the several independent cuesused in the tracker enable it to perform in a robust manner. Thecomplete motion capture system has been written in Matlab andwe note that currently the computational requirements of the

Page 12: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

SUNDARESAN AND CHELLAPPA: MULTICAMERA TRACKING OF ARTICULATED HUMAN MOTION USING SHAPE AND MOTION CUES 2125

system are high primarily due to the number of cameras usedin the processing and the inefficiency of the Matlab platform.The tracking process takes approximately 10–15 s per frame ona Pentium Xeon 2-GHz processor. Some of the most computa-tionally intensive modules such as the projection of the humanbody model onto each of the images can be optimized and alsomade parallel and it is possible to optimize the system to operateat speeds of 1 frame per second or better. We anticipate that ourmotion capture system can be optimized and polished for use ina variety of important applications in biomechanical and clin-ical analysis, human computer interaction and animation.

REFERENCES

[1] D. Ramanan and D. A. Forsyth, “Finding and tracking people from thebottom up,” in Proc. IEEE Conf. Computer Vision and Pattern Recog-nition, Madison, WI, Jun. 2003, vol. 2, pp. 467–474.

[2] S. Wachter and H.-H. Nagel, “Tracking persons in monocular image se-quences,” Comput. Vis. Image Understand., vol. 74, no. 3, pp. 174–192,Jun. 1999.

[3] H. Sidenbladh, M. J. Black, and D. J. Fleet, “Stochastic tracking of 3Dhuman figures using 2-D image motion,” in Proc. Eur. Conf. ComputerVision, Dublin, Ireland, Jun. 2000, vol. 2, pp. 702–718.

[4] S. X. Ju, M. J. Black, and Y. Yacoob, “Cardboard people: A parameter-ized model of articulated image motion,” in Proc. Int. Conf. AutomaticFace and Gesture Recognition, Killington, VT, Oct. 1996, pp. 38–44.

[5] C.-W. Chu, O. C. Jenkins, and M. J. Mataric, “Markerless kinematicmodel and motion capture from volume sequences,” in Proc. IEEEConf. Computer Vision and Pattern Recognition, Madison, WI, Jun.2003, vol. 2, pp. 475–482.

[6] I. Mikic, M. Trivedi, E. Hunter, and P. Cosman, “Human body modelacquisition and tracking using voxel data,” Int. J. Comput. Vis., vol. 53,no. 3, 2003.

[7] L. Mündermann, S. Corazza, and T. Andriacchi, “Accurately mea-suring human movement using articulated ICP with soft-jointconstraints and a repository of articulated models,” presented at theIEEE Conf. Computer Vision and Pattern Recognition, Minneapolis,MN, Jun. 2007.

[8] A. Sundaresan and R. Chellappa, “Model driven segmentation and reg-istration of articulating humans in Laplacian Eigenspace,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1771–1785, Oct. 2008.

[9] K. Cheung, S. Baker, and T. Kanade, “Shape-from-silhouette of artic-ulated objects and its use for human body kinematics estimation andmotion capture,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, Madison, WI, Jun. 2003, vol. 1, pp. 77–84.

[10] D. M. Gavrila, “The visual analysis of human movement: A survey,”Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999.

[11] T. Moeslund and E. Granum, “A survey of computer vision-basedhuman motion capture,” Comput. Vis. Image Understand., vol. 81, pp.231–268, 2001.

[12] L. Sigal and M. Black, Humaneva: Synchronized Video and MotionCapture Dataset for Evaluation of Articulated Human Motion, BrownUniv., Providence, RI, Tech. Rep. CS-06-08, 2006.

[13] N. I. Badler, C. B. Phillips, and B. L. Webber, Simulating Humans.Oxford, U.K.: Oxford Univ. Press, 1993.

[14] D. Morris and J. M. Rehg, “Singularity analysis for articulated objecttracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, Santa Barbara, CA, Jun. 1998, pp. 289–297.

[15] J. Aggarwal and Q. Cai, “Human motion analysis: A review,” Comput.Vis. Image Understand., vol. 73, no. 3, pp. 428–440, 1999.

[16] R. Plänkers and P. Fua, “Articulated soft objects for video-basedbody modeling,” in Proc. Int. Conf. Computer Vision, Vancouver, BC,Canada, Jul. 2001, vol. 1, pp. 394–401.

[17] D. Demirdjian, T. Ko, and T. Darrell, “Constraining human bodytracking,” in Proc. Int. Conf. Computer Vision, Nice, France, Oct.2003, vol. 2, pp. 1071–1078.

[18] N. Krahnstoever and R. Sharma, “Articulated models from video,” inProc. IEEE Conf. Computer Vision and Pattern Recognition, Wash-ington, DC, Jun. 2004, vol. 1, pp. 894–901.

[19] X. Lan and D. P. Huttenlocher, “A unified spatio-temporal articulatedmodel for tracking,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, Washington, DC, Jun. 2004, vol. 1, pp. 722–729.

[20] C. Sminchisescu and B. Triggs, “Covariance scaled sampling formonocular 3D body tracking,” in Proc. IEEE Conf. Computer Visionand Pattern Recognition, Kauai, HI, Dec. 2001, vol. 1, pp. 447–454.

[21] C. Sminchisescu and B. Triggs, “Kinematic jump processes for monoc-ular 3D human tracking,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, Madison, WI, Jun. 2003, vol. 1, pp. 69–76.

[22] I. A. Kakadiaris and D. Metaxas, “Model-based estimation of 3Dhuman motion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no.12, pp. 1453–1459, Dec. 2000.

[23] Q. Delamarre and O. Faugeras, “3D articulated models and multi-viewtracking with silhouettes,” in Proc. Int. Conf. Computer Vision,Kerkyra, Corfu, Greece, Sep. 1999, vol. 2, pp. 716–721.

[24] T. Moeslund and E. Granum, “Multiple cues used in model-basedhuman motion capture,” in Proc. Int. Conf. Face and Gesture Recog-nition, Grenoble, France, Mar. 2000, pp. 362–367.

[25] M. Yamamoto and K. Koshikawa, “Human motion analysis based ona robot arm model,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, Maui, HI, Jun. 1991, pp. 664–665.

[26] M. Yamamoto, A. Sato, S. Kawada, T. Kondo, and Y. Osaki, “Incre-mental tracking of human actions from multiple views,” in Proc. IEEEConf. Computer Vision and Pattern Recognition, Santa Barbara, CA,Jun. 1998, pp. 2–7.

[27] C. Bregler and J. Malik, “Tracking people with twists and exponentialmaps,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition,Santa Barbara, CA, Jun. 1998, pp. 8–15.

[28] L. Sigal, M. Isard, B. H. Sigelman, and M. J. Black, “Attractive people:Assembling loose-limbed models using non-parametric belief propa-gation,” in Proc. Conf. Neural Information Processing Systems, Van-couver, BC, Canada, 2003, pp. 1539–1546.

[29] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard, “Trackingloose-limbed people,” in Proc. IEEE Conf. Computer Vision and Pat-tern Recognition, Washington, DC, Jun. 2004, vol. 1, pp. 421–428.

[30] D. Gavrila and L. Davis, “3-D model-based tracking of humans in ac-tion: A multi-view approach,” in Proc. IEEE Conf. Computer Visionand Pattern Recognition, 1996, pp. 73–80.

[31] A. Sundaresan and R. Chellappa, “Segmentation and probabilistic reg-istration of articulated body model,” in Proc. Int. Conf. Pattern Recog-nition, Hong Kong, Aug. 2006, vol. 2, pp. 92–96.

[32] A. Sundaresan and R. Chellappa, “Acquisition of articulated humanbody models using multiple cameras,” in Proc. Conf. Articulated Mo-tion and Deformable Objects, Port d’Andratx, Mallorca, Spain, Jul.2006, pp. 78–89.

[33] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometricframework for nonlinear dimensionality reduction,” Science, vol. 290,no. 5500, pp. 2319–2323, 2000.

[34] L. Mündermann, S. Corazza, A. M. Chaudhari, E. J. Alexander, andT. P. Andriacchi, “Most favorable camera configuration for a shape-from-silhouette markerless motion capture system for biomechanicalanalysis,” in Proc. SPIE Videometrics, Jan. 2005, vol. 5665.

[35] R. M. Murray, Z. Li, and S. S. Sastry, A Mathematical Introduction toRobotic Manipulations. Boca Raton, FL: CRC, 1994.

[36] A. Sundaresan and R. Chellappa, “Multi-camera tracking of articulatedhuman motion using motion and shape,” in Proc. Asian Conf. ComputerVision, Hyderabad, India, Jan. 2006, vol. 2, pp. 131–140.

[37] A. Sundaresan, A. Roy-Chowdhury, and R. Chellappa, “Multiple viewtracking of human motion modelled by kinematic chains,” in Proc.IEEE Int. Conf. Image Processing, Barcelona, Spain, Sep. 2004, vol.2, pp. 93–96.

[38] T. Svoboda, D. Martinec, and T. Pajdla, “A convenient multi-cameraself-calibration for virtual environments,” PRESENCE: Teleoperatorsand Virtual Environments, vol. 14, no. 4, Aug. 2005.

Aravind Sundaresan received the B.E. (Hons.) de-gree from the Birla Institute of Technology and Sci-ence, Pilani, India, in 2000, and the M.S. and Ph.D.degrees from the Department of Electrical and Com-puter Engineering, University of Maryland, CollegePark, in 2005 and 2007, respectively.

He is currently a computer scientist in the Artifi-cial Intelligence Center at SRI International, MenloPark, CA. His research interests are in pattern recog-nition, image processing, and computer vision, and,in particular, their application to markerless motion

capture, real-time pose tracking in crowds, and 3-D scene modeling and navi-gation for for mobile robots.

Page 13: Multicamera Tracking of Articulated Human Motion Using Shape and Motion Cues

2126 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 9, SEPTEMBER 2009

Rama Chellappa (S’78–M’79–SM’83–F’92) re-ceived the B.E. (Hons.) degree from the Universityof Madras, India, in 1975, and the M.E. (Distinction)degree from the Indian Institute of Science, Banga-lore, in 1977. He received the M.S.E.E. and Ph.D.Degrees in Electrical Engineering from PurdueUniversity, West Lafayette, IN, in 1978 and 1981,respectively.

Since 1991, he has been a Professor of electricalengineering and an affiliate Professor of computerscience at the University of Maryland, College Park.

He is also affiliated with the Center for Automation Research (Director) andthe Institute for Advanced Computer Studies (Permanent Member). In 2005,he was named a Minta Martin Professor of Engineering. Prior to joiningthe University of Maryland, he was an Assistant (1981-1986) and AssociateProfessor (1986-1991) and Director of the Signal and Image ProcessingInstitute (1988-1990) at University of Southern California, Los Angeles. Overthe last 28 years, he has published numerous book chapters, peer-reviewedjournal, and conference papers. He has coauthored and edited many booksin visual surveillance, biometrics, MRFs and image processing. His currentresearch interests are face and gait analysis, 3-D modeling from video, imageand video-based recognition, and exploitation and hyper spectral processing.

Prof. Chellappa has served as an Associate Editor of many IEEE Transac-tions, as a Co-Editor-in-Chief of Graphical Models and Image Processing, andas the Editor-in-Chief of the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND

MACHINE INTELLIGENCE. He served as a member of the IEEE Signal ProcessingSociety Board of Governors and as its Vice President of Awards and Member-ship. He is serving a two-year term as the President of IEEE Biometrics Council.He has received several awards, including an NSF Presidential Young Investi-gator Award, four IBM Faculty Development Awards, an Excellence in TeachingAward from the School of Engineering at USC, two paper awards from the In-ternational Association of Pattern Recognition, and the Technical Achievementand Meritorious Service Awards from the IEEE Signal Processing Society andthe IEEE Computer Society. At the University of Maryland, he was elected aDistinguished Faculty Research Fellow and a Distinguished Scholar-Teacher;he received the Outstanding Faculty Research Award from the College of Engi-neering and an Outstanding Innovator Award from the Office of TechnologyCommercialization. He is a Fellow the International Association for PatternRecognition and the Optical Society of America. He has served as a Generalthe Technical Program Chair for several IEEE international and national con-ferences and workshops. He is a Golden Core Member of the IEEE ComputerSociety and serving a two-year term as a Distinguished Lecturer of the IEEESignal Processing Society.