11
J Sign Process Syst (2009) 57:285–295 DOI 10.1007/s11265-008-0250-2 Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano · Kazuhiro Otsuka Received: 12 September 2007 / Revised: 25 April 2008 / Accepted: 12 June 2008 / Published online: 12 July 2008 © The Author(s) 2008 Abstract In this work, we implement a real-time visual tracker that targets the position and 3D pose of objects in video sequences, specifically faces. The use of stream processors for the computations and efficient Sparse- Template-based particle filtering allows us to achieve real-time processing even when tracking multiple ob- jects simultaneously in high-resolution video frames. Stream processing is a relatively new computing par- adigm that permits the expression and execution of data-parallel algorithms with great efficiency and min- imum effort. Using a GPU (graphics processing unit, a consumer-grade stream processor) and the NVIDIA CUDA™ technology, we can achieve performance im- provements as large as ten times compared to a sim- ilar CPU-only tracker. At the same time, the Stream processing approach opens the door to other computing devices, like the Cell/BE™ or other multicore CPUs. Keywords Stream processing · GPGPU · CMP · Particle filtering · Video tracking · Real-time systems O. Mateo Lozano · K. Otsuka (B ) NTT Communication Science Laboratories, 3-1, Morinosato-Wakamiya, Atsugi-shi 243-0198, Japan e-mail: [email protected] O. Mateo Lozano e-mail: [email protected] O. Mateo Lozano Image Processing Group, Universidad Politécnica de Madrid, Madrid, Spain 1 Introduction Fast and robust object tracking in video sequences is re- quired by many applications in many fields: automated surveillance systems follow their targets using these techniques, robots rely on it to perform navigation tasks or man-machine interaction, augmented reality systems depend on the position data acquired by visual tracking to place their virtual objects in the real world, video- games or assisted devices can be controlled thanks to a camera and some face or hand tracking software, to name just a few. Our motivation for developing a real-time face video tracker is to advance research on a system [1] that can infer conversation structure from video sequences of face-to-face communication, in which it is assumed that the gaze direction of the participants, which provide cues for discerning the conversation structure, can be identified from head directions. This system could be applied to a wide range of video-based applications such as computer mediated communications. The con- straints we impose on this tracker are: it has to be completely automatic, robust against rapid movement and partial occlusion, work with just one camera (no stereo-vision) on a conventional PC, and be able to track several faces simultaneously in real-time. 1.1 Particle Filtering Particle filtering [2] is a model estimation technique based on Monte Carlo simulations. Random values of a state-space variable are generated (the so-called particles), used in a description of the system, and checked against the current output measure to generate a weight value, or likeliness of that particle being the

Real-time Visual Tracker by Stream Processing

Embed Size (px)

Citation preview

Page 1: Real-time Visual Tracker by Stream Processing

J Sign Process Syst (2009) 57:285–295DOI 10.1007/s11265-008-0250-2

Real-time Visual Tracker by Stream ProcessingSimultaneous and Fast 3D Tracking of Multiple Faces in Video Sequencesby Using a Particle Filter

Oscar Mateo Lozano · Kazuhiro Otsuka

Received: 12 September 2007 / Revised: 25 April 2008 / Accepted: 12 June 2008 / Published online: 12 July 2008© The Author(s) 2008

Abstract In this work, we implement a real-time visualtracker that targets the position and 3D pose of objectsin video sequences, specifically faces. The use of streamprocessors for the computations and efficient Sparse-Template-based particle filtering allows us to achievereal-time processing even when tracking multiple ob-jects simultaneously in high-resolution video frames.Stream processing is a relatively new computing par-adigm that permits the expression and execution ofdata-parallel algorithms with great efficiency and min-imum effort. Using a GPU (graphics processing unit,a consumer-grade stream processor) and the NVIDIACUDA™ technology, we can achieve performance im-provements as large as ten times compared to a sim-ilar CPU-only tracker. At the same time, the Streamprocessing approach opens the door to other computingdevices, like the Cell/BE™ or other multicore CPUs.

Keywords Stream processing · GPGPU · CMP ·Particle filtering · Video tracking · Real-time systems

O. Mateo Lozano · K. Otsuka (B)NTT Communication Science Laboratories,3-1, Morinosato-Wakamiya, Atsugi-shi 243-0198, Japane-mail: [email protected]

O. Mateo Lozanoe-mail: [email protected]

O. Mateo LozanoImage Processing Group, Universidad Politécnica deMadrid, Madrid, Spain

1 Introduction

Fast and robust object tracking in video sequences is re-quired by many applications in many fields: automatedsurveillance systems follow their targets using thesetechniques, robots rely on it to perform navigation tasksor man-machine interaction, augmented reality systemsdepend on the position data acquired by visual trackingto place their virtual objects in the real world, video-games or assisted devices can be controlled thanks toa camera and some face or hand tracking software, toname just a few.

Our motivation for developing a real-time face videotracker is to advance research on a system [1] that caninfer conversation structure from video sequences offace-to-face communication, in which it is assumed thatthe gaze direction of the participants, which providecues for discerning the conversation structure, can beidentified from head directions. This system could beapplied to a wide range of video-based applicationssuch as computer mediated communications. The con-straints we impose on this tracker are: it has to becompletely automatic, robust against rapid movementand partial occlusion, work with just one camera (nostereo-vision) on a conventional PC, and be able totrack several faces simultaneously in real-time.

1.1 Particle Filtering

Particle filtering [2] is a model estimation techniquebased on Monte Carlo simulations. Random valuesof a state-space variable are generated (the so-calledparticles), used in a description of the system, andchecked against the current output measure to generatea weight value, or likeliness of that particle being the

Page 2: Real-time Visual Tracker by Stream Processing

286 J Sign Process Syst (2009) 57:285–295

one that best describes the current state of the system.Therefore, the collection of all these particles and theirweights in each instant is a numerical approximationof the probability density function of the system. Theparticle filter (PF) framework is the basis of the well-known condensation (conditional density propagation[3]) algorithm, which was originally proposed for con-tour tracking, but has been also successfully appliedto the appearance-based tracking of moving objects invideo sequences. The probabilistic approach of thesemethods provides significant robustness, as several pos-sible states of the system are tracked at any givenmoment.

A common problem with this kind of practicalmethod is its significant computational requirements:when the number of particles becomes very large,Monte Carlo characterization becomes a better repre-sentation of our process, but the costs of the algorithmiccalculations also increase. Fortunately, particle filtersare easy to parallelize; they require high arithmeticthroughput (as opposed to low latency), and have lowglobal communication and storage costs. It is our beliefthat the advent of consumer-grade parallel processorscan bring the robustness of these algorithms to real-time applications.

1.2 Stream Processing and GPGPU

We are specially interested in computer graphics chips(known as “graphics processing units” or GPUs), be-cause they are currently the most powerful and cheapcomputational hardware available [4]. These chips havegone from fixed-application peripherals to modern,powerful, and programmable general purpose proces-sors. In recent years, there has been strong inter-est from researchers and developers in exploiting thepower of commodity graphics hardware for general-purpose computing (this movement is known asGPGPU, for “general purpose GPU”). Unfortunately,the GPU uses an unusual programming model, so ef-fective GPU programming requires rewriting the tar-get algorithm into graphics terms; the programmermust be familiar with the limitations of the underlyinghardware.

We also note that other multicore processors (orCMPs, for “chip multiprocessors”), like GPUs, sufferfrom programming difficulties. The traditional serialprogramming approach does not take advantage of theadditional cores in the processor. The typical threadprogramming method (which raises the issues of syn-chronization, deadlock, load balancing, ...) is very hardto implement and does not scale well given the numberof cores in current and future multicore CPUs. The Cell

Broadband Engine processor allows many kinds of par-allel processing techniques (such as DMA operations)and program scheduling in the “synergistic process-ing elements” can be done by hand by the program-mer. The complexity of this, however, makes this tasknon-trivial.

The programming paradigm raised by stream com-puting can be described as, given a set of input andoutput data (streams), to define some computation-intensive operations (kernel functions) that are to beapplied to each element in the stream while exploitingthe data independency and locality typical in media-processing applications. The programmer is forced bythis (intuitive) programming model to express his/herapplication in a way that adapts well to the computa-tional resources of CMPs. Many believe that Streamprocessing represents an important advancement inmaking parallel processing easily accessible to pro-grammers: NVIDIA Corp. provides now a full archi-tecture based on the Stream processing model calledCUDA™ (Compute Unified Device Architecture [5]),and AMD, Inc. has recently released a stream processorsolution based on both hardware and a low-level SDKcalled CTM™ (Close To Metal) [6]. Rapidmind [7] is asoftware solution that can realize Stream processing ontop of the most common CMP architectures, ....

Stream processing offers more than just ease of pro-gramming: architectures that map well to this paradigm(“stream processors”) can achieve higher performancethan other architectures [8], as the locality and concur-rency enforced by this paradigm (and the associateddata bandwidth hierarchy) allows more of the die tobe devoted to ALUs instead of caching and memoryaccess circuitry. GPUs (especially since the advent ofunified architectures that exchange the vertex and pixelshaders found in last-generation devices for genericSIMD processors, albeit with some limitations [9]) orthe Cell BE (with appropriate software support) can beconsidered general-purpose stream processors. Univer-sities like Stanford [10] and companies like SPI [11] areworking on pure stream processor implementations.

It is our belief that the particle filter framework ap-plies especially well to the stream processing paradigm:the operations to be performed are very simple (in ourcase, they consist mainly of geometric transformations),the data to use is highly localized (each particle isself-contained), and needs little memory, so we canuse the ability offered by stream processors (performmany computations extremely rapidly and in parallel)to achieve our real-time system. Computer vision, con-sidered the inverse of computer graphics, has tradi-tionally been very well suited for GPGPU mapping, assome authors have pointed out ([12]); some have even

Page 3: Real-time Visual Tracker by Stream Processing

J Sign Process Syst (2009) 57:285–295 287

studied the application of pure GPGPU techniques toother particle filtering tracking algorithms with greatsuccess (like the 2D visual tracker in [13, 14]). Streamprocessing simplifies the scenario for every GPGPUproject, but computer vision is still a clear winner dueto the fact that some GPU functions (like ProjectiveTexture Lookups [11]) remain in next generation GPUs[5] and have specific silicon devoted to them. To thebest of our knowledge, this is the first 3D object trackerto be based on Stream processing.

2 Method Overview

2.1 Sparse Template Condensation Tracking

To define the problem of tracking, consider the statesequence {xt, t ∈ N} of a target, composed of, in ourcase by:

xt = (Tx, Ty, S, Rx, Ry, Rz, α) (1)

states (we omit for simplicity some instances of sub-script t), where Tx, Ty are the translation coordinates ofthe object under study, S is the scale, Rx, Ry, Rz are therotations along each axis and α is a global illuminationvariable. When the target to track is a human face, xt

represents the location and pose of that face (Fig. 1).

This state-space vector is given in each discrete timestep t > 0 by:

xt = f t(xt−1, vt−1) (2)

where f t : Rnx × R

nv → Rnx is a possibly non-linear

function of state xt−1, {vt−1, t ∈ N} is an independentand identically-distributed process noise sequence, andnx, nv are dimensions of the state (7 in our case) andprocess noise vectors, respectively. Equation 2 is oftencalled the dynamics model or state transition model.The objective of tracking is to recursively estimate xt

from measurements

zt = ht(xt, nt) (3)

where ht : Rnx × R

nn → Rnz is a possibly non-linear

function, {nt, t ∈ N} is an independent and identically-distributed measurement noise sequence, and nz, nn

are dimensions of the measurement and measurementnoise vectors, respectively. Equation 3 is often calledthe observation model.

If we take a probabilistic approach to solve thisproblem, we seek filtered estimates x̃t based on the setof all available measurements z1:t = {zi, i = 1, ..., t} upto time t, together with some degree of belief in them.Namely, we want to know the p.d.f. p(xt|z1:t), assumingthat the initial p.d.f. p(x0|z0) = p(x0) is known.

The optimal solution to this problem can be foundif we assume certain constraints [15], but this is usuallynot the case. One solution, the particle filter method,

Figure 1 Results of thesimultaneous tracking of fourpeople. The frame sequenceis taken from a synthetic1,024 × 768 video, the sparsetemplates are composed ofapproximately 230 featurepoints, and each one istracked using 1,000 particles.

Page 4: Real-time Visual Tracker by Stream Processing

288 J Sign Process Syst (2009) 57:285–295

is an approximation method that makes use of MonteCarlo simulations [2]. We now consider discrete parti-cles forming the set:(

X̃ t−1, �t−1

)=

{(x̃0

t−1, π0t−1

), ...,

(x̃M

t−1, πMt−1

)}(4)

Every particle contains information about one possiblestate of the system x̃ j

t−1 and its importance weight πj

t−1.This sample set is propagated in the Condensationalgorithm as follows:

Select A set of M new particles (X̃ t, �t) is generated,by random selection from the previous particles byusing a p.d.f. according to their weights �t−1.

Diffuse A Gaussian random noise is added to eachparameter of each particle in X̃ t.

Measure weight The probability �t is measured usingtemplate matching, based on the difference error be-tween the template (the collection of all pixels formingthe face in t = 0, or some carefully selected pixels ofthe template if we use a sparse template [16, 17] as wedescribe in Section 2.3) and the input image at eachtime step.

The matching error, εjt , is calculated based on the

difference in intensity values between selected pixelsin the template (feature points) and the correspond-ing pixels in the image at time t. A feature pointin the template is projected onto the image planebased on the state vector of the particle, x̃ j

t . Assumingweak-perspective projection, a point on template q =[qx, qy, qz]T is projected to point p = [px, py]T on theimage plane as written in

[px

py

]= S · R2×3

⎡⎣

qx

qy

qz

⎤⎦ +

[Tx

Ty

](5)

[R2×3

ω1ω2ω3

]= Rx Ry Rz (6)

Rx =⎡⎣

1 0 00 cos Rx − sin Rx

0 sin Rx cos Rx

⎤⎦ (7)

Ry =⎡⎣

cos Ry 0 sin Ry

0 1 0− sin Ry 0 cos Ry

⎤⎦ (8)

Rz =⎡⎣

cos Rz − sin Rz 0sin Rz cos Rz 0

0 0 1

⎤⎦ (9)

where R2×3 denotes the 2 × 3 upper submatrix of therotation matrix R. The matching error between thetemplate and input image is defined based on the inten-sity of each feature point, J(qm), m = 1, · · · , N, andthe intensity of the corresponding projected point,I( pm), modified by the global illumination factor α.More precisely, the matching error, defined as theaverage matching error of N feature points, can bewritten as

ε = 1

N

N∑m=1

ρ (κ (J(qm), I( pm))) (10)

κ(J, I) = α · I − JJ

(11)

ρ(κ) = κ2

κ2 + 1(12)

where κ(J, I) denotes the relative residual betweenintensity values of the feature point q and the projectedpoint p, and ρ(κ) represents a robust function whichis used to increase the robustness against outlier pixels[17]. From the matching error for each particle, ε j, j =1, · · · , M, the weight of each particle is calculated as

π j ∝ 1/ε j (13)

where∑M

j=1 π j = 1. Finally, if we rearrange the par-ticles in descending order of weight, we can estimatethe current state as the average of the best M′ ≤ Mparticles by:

x̃t =∑M′

i=1 π it · x̃i

t∑M′i=1 π i

t

(14)

2.2 Algorithmic Scheme

Our tracker can be studied in three big blocks: initial-ization, tracking and display (Fig. 2). In our system, theinitialization stage is performed in a separate thread inthe host system, while the main thread performs thetracking using the GPU and displays the results:

Initialization The initialization thread scans the imagelooking for new faces that are not currently beingtracked. This can be done either continuously or everyn frames, as new faces are not expected to pop up at anytime. When a new frontal face is detected (time t0 withregard to tracking), the description template is obtainedfrom the image. N feature points are extracted by theselected criteria, and the resulting stream of data (con-sisting of gray level values, x, y, z coordinates of thepoints, and normals to the face surface. See Section 2.3)

Page 5: Real-time Visual Tracker by Stream Processing

J Sign Process Syst (2009) 57:285–295 289

Figure 2 Algorithmic scheme of the system.

is created for each object. At time 0, M particles perobject being tracked are randomly selected and theirstate-space values filled with random values (uniformlydistributed around the well-known initial state in t0: thetranslation equals the current position of the template,the scale is 1, rotation around every angle 0, and illu-mination 1). The weight of every particle at this pointis the same for every particle (and equals to 1/M). SeeSection 2.3 for a more detailed description of this stage.

Tracking In this stage the actual particle filtering isperformed. As described before, it consists of the stepsof selection or drifting, diffusion, and weight measure-ment. See Section 2.4 for a more detailed descriptionof the weighting step and how the workload is dividedbetween the host processor and the stream processor.Once the weights of each particle are obtained, we passthe result to the display stage and perform the selectionof new particles, and the random diffusion of them (bymeans of a Gaussian noise with mean and standarddeviation chosen carefully to provide both stable andfast tracking) in order to diversify the set and avoid thedegeneracy [15] problem. After that, this stage startsagain and is performed indefinitely, unless the qualityof the tracking results degrades so much that we mustconsider that tracking has failed (this may be causedby the face leaving the video sequence, excessive occlu-sion, or if some other limits are surpassed).

Display After every tracking step, we average the bestparticles to get our approximation as the system output.The resulting state-space values can be written as anoutput text-file, feed to another system for consump-tion, or displayed on the screen as rotated, scaled andtranslated 3D-mesh models (Fig. 1).

2.3 Initialization Stage

In order to detect new frontal faces in the image, aViola and Jones [18, 19] boosting algorithm is em-ployed, as these detectors are quite fast and have beenproven to work very well in practice. After checkingthat the detected face is not currently being tracked(by simply comparing it to known face positions), thesubimage formed by the detected rectangle is passed tothe next step, for template extraction.

In the sparse template matching method [16, 17], asparse template is carefully made up by a small setof pixels (feature points) from a full template, withthe idea of making the tracking more efficient by re-ducing the number of calculations. In addition to this,we resort to the sparse template matching method tofind relevant points to track, and treat those pointsas uni-dimensional streams of data to raise processingefficiency.

In this method, the human face is typically approxi-mated as a planar surface forming the sparse template(this is, the z coordinate of each feature point is simplya constant). In order to increase the precision, moredetailed models of the human face can be used: parame-terized geometric models like, for example, sections ofa cylinder or ellipsoid, 3D-mesh models.... Our systemadopts a generic 3D model of the human face, thatwe personalize to each detected face by means of theActive Appearance Model.

Active Appearance Model [20] (AAM) is a Com-puter Vision algorithm for matching a statistical de-formable model of the object’s shape and appearanceto a new image. They are built during a training phaseusing a set of images together with the coordinates ofsome landmarks (Fig. 3a). We perform the fitting of oneof these models over the face subimage, thus obtainingthe 2D coordinates of a series of landmark points (53 inour case) that correspond to previously known featuresof the human face (corners of the mouth, eyes, headoval, etc....). We have modified these landmarks to alsoinclude a pair of texture coordinates, a set of numbersthat refer to a unique part of a texture resource storedin memory. In this way, we can create two image buffersthe same size as the face image by applying (mapping)to them our 3D face model; this pair is described as a“heightmap” and a “normalmap”.

A heightmap, or heightfield, is a raster image (storedas gray level image) in which each pixel represent thesurface elevation data of an object. On the other hand,a normalmap, or bumpmap, is a raster image (storedas a RGB color image) in which each pixel containsa normal XYZ vector (each color component contain-ing a coordinate). These normals represent the vector

Page 6: Real-time Visual Tracker by Stream Processing

290 J Sign Process Syst (2009) 57:285–295

a b c

Figure 3 a Some of the images used to train the AAM, together with their landmarks; b the heightmap used for our generic face model;c the normalmap.

perpendicular to the surface of the 3D face model. SeeFig. 3 for a graphical explanation of these textures, andFig. 4 for the model creation process.

Next, the feature points are selected from local min-imum/maximum points and boundary dipoles on theimage, as in [16]. The local minimum/maximum pointsare defined as local extreme points over the gray-leveldistribution of the image. First, the candidate extremepoints are detected by checking the 8-neighbors of eachpixel in the image. The final local minimum/maximumpoints are then selected in ascending/descending orderof the gray-level values of the candidate pixels; eachpoint is selected so that it keeps a certain distancefrom other points that have already been selected.On the other hand, the boundary dipoles are definedas line segments that straddle and intersect at rightangles the image boundaries in the input image. Theboundary-dipole-based feature points are defined asboth end points of the boundary dipoles. Here, theimage boundaries are extracted as zero-cross bound-aries in a zero-cross image, which is created by ap-plying a Laplacian–Gaussian filter to the input image.

The boundary dipoles are selected in descending orderof the absolute difference in gray-level values of theend points of the dipoles, where the same separationcriteria is applied as in the case of minimum/maximumpoints.

Finally, with the information taken from the threeimages (the face rectangular subimage, and the depthand normal buffers), our template can be formed by astream of feature points formed as follows:

p = (Px, Py, Pz, Nx, Ny, Nz, J) (15)

where Px, Py, Pz are the coordinates of the featurepoint, Nx, Ny, Nz form the vector normal to the facesurface in the feature point (see Section 2.4 for the useof this information) and J is the gray level of the featurepoint.

2.4 Weighting Stage

Of all the steps that form a PF algorithm (parti-cle weighting, selection and diffusion), particle weight

a b c d e

Figure 4 a Frontal face image at time t = 0; b the Viola and Jonesdetector finds a rectangle containing a face, and the AAM is 2D-fitted to the shape of that face; c depth map texture is warped tothe AAM shape; d the same for the normal map; e the feature

points (position and gray level) that form the sparse template areselected using image processing techniques, and their depth andnormal to the surface values are extracted from prior maps.

Page 7: Real-time Visual Tracker by Stream Processing

J Sign Process Syst (2009) 57:285–295 291

computation is the most expensive one. What this stagedoes is score each particle by means of a likelihoodweight between each one of the transformed featurepoints and the equivalent point in the current frame.The particle filter algorithm itself is computationallyexpensive, but the weight computation is the mainbottleneck, and so we decided to subject it to Streamprocessing. Our two input streams are composed bythe stream formed by all the particles, and the streamformed by all the feature points.

Weight calculation of each particle is an indepen-dent process, as is the matching error calculation foreach feature point. Our method exploits these inde-pendencies to enable parallel processing: the kernelsmust perform the 3D transformation of each featurepoint as estimated by each of the particles, and thena comparison is made of the feature point gray levelagainst the resulting point in the full image. The sumof all those comparisons for each feature point resultsin the weight of one particle. This is our output stream:the collection of weight values of every particle.

Using GPGPU techniques, we would perform thisprocess using two passes, each one with a different ker-nel. The first kernel would take as input the two streamsof data (M particles and N feature points) and outputa 2D stream (M rows and N columns) containing ineach element the error contributions of each particleand each feature point (geometrical transformation asin Eq. 5 and individual contributions as in Eq. 10).The second kernel computes a smaller stream from thelarger 2D input stream, by summing all the elements inthe same row (as in Eq. 10). This type of computationis called a reduction [21]. This is necessary because, inthe traditional GPGPU, each processing unit is limitedto its own memory and has access to only a few fastregisters, unlike the much larger memory limits on astandard streaming process. CUDA, however, providesan architecture in which groups of SIMD processors(called Multiprocessors) share access to a fast, commonmemory. Each Multiprocessor can then be considereda standard streaming unit, and we can use that abilityto perform, in just one kernel, the full weighting op-eration (each Multiprocessor takes one element fromthe stream of particles and the full N elements fromthe stream of feature points, and computes one weightvalue towards the final output stream).

We still haven’t explained the use of the vectors nor-mal to the surface in each feature point: we can employthis extra information to realize a simple feature pointocclusion detector. Transforming (the same as is donewith the feature positions, except we consider only therotation effect) the normals yields a coarse measureof the 3D face model pose, that is used to discard

those points that are likely to be occluded by the faceitself (that is, normal vectors that after rotation show anegative value of the z coordinate do not point towardsthe camera and are likely to be occluded in the finalimage, not contributing to the error measurement orthe final weight of the particle).

3 CUDA Implementation Details

NVIDIA CUDA (Compute Unified Device Architec-ture [5]) is a hardware and software architecture thatallows the GPU to be viewed as a data-parallel com-puting device that operates as a coprocessor to the mainCPU (the host). The device maintains its own DRAM,that can be filled with data from the host DRAMvia DMA transfers. The computing approach is thatof the Stream processing paradigm: the user definesinput streams and a kernel or program composed ofoperations to be executed over that data in a parallelfashion. In CUDA, the same kernel is executed bymany different “threads” organized as follows:

A thread block is a batch of threads that can co-operate by efficiently sharing data through some fastshared memory and synchronizing their execution tocoordinate memory accesses. Blocks of the same di-mensionality and size that execute the same kernel canalso be batched together (a grid of blocks), augmentingthe total number of threads that can be launched by asingle kernel invocation; note that there cannot be anycommunication between threads of different blocks.

The device is implemented as a set of multiproces-sors, and each multiprocessor is composed of manyprocessing units organized as a Single Instruction, Mul-tiple Data architecture (SIMD). A grid is executedon the device by executing one or more blocks oneach multiprocessor using time slicing: each block issplit into SIMD groups of threads called warps, with athread scheduler periodically switching from one warpto another to maximize the use of the multiprocessor’scomputational resources. The blocks will run sequen-tially if the underlying hardware has very few parallelcapabilities, in parallel if it has a lot of them, or in amixture of both.

Our algorithm (Fig. 5) uses CUDA as follows:

– At time 0, the host dumps the contents of a buffercontaining the feature point stream to device mem-ory. This input stream doesn’t change during thetracking process, so it is kept in device memory andused in every iteration of the algorithm.

Page 8: Real-time Visual Tracker by Stream Processing

292 J Sign Process Syst (2009) 57:285–295

Figure 5 Streams and kernels in the system.

– At every step, the host sends the current videoframe and a buffer containing the particle streamto device memory.

– The kernel is invoked and executed in M blocks,each with N threads (that is, one block per particleand one thread per feature point per block). Eachthread computes the matching error of a parti-cle/feature pair (transform the feature point posi-tion and normal vector, accesses the video framepixel in that position if there is no occlusion, andcompares to the feature point gray level). The resultis placed by every thread in a different position ofthe block’s shared memory.

– Next, an instruction in the kernel synchronizes allthe threads in a block (makes them wait until all thethreads in the block arrive to the execution point inthe kernel), and then allows the first thread of everyblock to loop through the shared memory to sumall matching errors. The result is placed in globaldevice memory.

– The host recovers from device memory the outputstream containing the weight of every particle. The

algorithm continues its normal flow of executionuntil a new video frame and stream of particlesrequires processing.

4 Experiments, Results and Future Improvement

A first proof of concept was developed using the Brooklanguage [22] from Stanford University. Brook forGPUs is a compiler and runtime implementation of theBrook stream program language for graphic chips; itquite successfully hides the burden of handling GPGPUtechniques from the programmer. The algorithm wasvalidated, but some factors prompted us to develop amore tuned version using NVIDIA CUDA [5]. WhileBrook works with a big range of graphic cards, CUDAis limited to just a few new architectures by NVIDIA(at the time of writing this article only GeForce 8-seriesGPUs) that better resemble a pure stream processor(shared memory model, scatter operations...) allowinga more efficient implementation of the algorithm.

The main limiting factor we found with Brook wasthe impossibility of implementing the operations de-scribed in Section 3 with only one kernel, requiringinstead the use of multipass techniques (several chainedkernels, each one running on the output stream of theprevious one). The reason was that Brook (targetingolder GPU architectures than CUDA) lacks the sharedmemory between processing units (and therefore, thereis no possible communications between them) and alsothe ability of synchronizing the different threads: seeSection 2.4 for a description of the steps involved inthis GPGPU (as opposed to SP) technique. Brook alsosuffers from the excessive overhead associated with

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

10

20

30

40

50

60

70

80

90

100Particle weighting speed

Number of particles

Tim

e pe

r fr

ame

(ms)

GPUSoftware

Figure 6 Speed of the particle weighting stage, comparingStream processing in the GPU version to a serial CPU-onlyversion. Video 1,024 × 768, 217 feature points.

Page 9: Real-time Visual Tracker by Stream Processing

J Sign Process Syst (2009) 57:285–295 293

1 2 3 4 5 60

5

10

15

20

25

30

35

40Full application speed

Faces

Fram

es p

er s

econ

d

GPUSoftware

Figure 7 Speed of the full application, comparing streamprocessing in the GPU version to a serial CPU-only version.Video 1,024 × 768, ≈230 feature points per face, and 1,000 parti-cles per face.

using a high level language that “metacompiles” [22] toC++ and a run-time library for OpenGL/DirectX plusCg/HLSL operations; CUDA, on the other hand, is anative architecture.

The developed software (a mixture of C++ andCUDA) has been tested on an Intel Core II 2.66 GHzhost system with 2GB RAM, using a NVIDIA GeForce8800GTX GPU as the stream processor. This GPUfeatures 128 processing units organized in 16 multi-processors for a peak performance (approximate) of350 GFLOPS under CUDA. For comparison purposes,a software-only version of the weighting algorithm wasalso developed (single-threaded, no SIMD extensionsused) for execution on the host CPU alone.

For the adaptive models, we used AAM-API, aC++ implementation of the Active Appearance Modelframework developed by Mikkel B. Stegmann [23].The generic face model heightmap and normalmapwere created by hand on the base of a subdividedCANDIDE [24], a parameterized face mask specificallydeveloped for model-based coding of human faces.The Viola and Jones implementation provided withOpenCV (Open Computer Vision Library) was used asthe frontal face detector.

The results indicate an important speed boost com-pared to the CPU-only version of the algorithm,especially when using a large number of particles(Fig. 6) and/or tracking multiple objects simultaneously(Fig. 7), making the tracker eminently suitable for real-time processing in a standard PC platform.

Unlike other approaches to fast hardware solutionsfor particle filtering, like dedicated architectures orFPGA solutions, stream processing targets commercialoff-the-shelf (COTS) processors, be it a GPU, a Cell

BE, or a Symmetric Multi-Processor (SMP). However,many things can be learned from these approaches(usually taken with the inherent parallelism of PFs inmind) that could lead to big improvements in our SPimplementation.

One of those papers is [25], in which considerableeffort was made to improve resampling algorithms forPFs so that they could be efficiently implemented inhardware in a distributed fashion. In our specific PFproblem, we feel the weighting/importance step repre-sents a bigger bottleneck than the selection/resamplingstep, so we have decided to perform the latter on theCPU (we are thus performing Centralized Resamplingin [25] terminology). Other types of distributed resam-pling could lead to the complete tracking stage beingperformed on the GPU, without having to move allweight floating point values from device to host mem-ory and thus overcoming the graphic bus bandwidthbottleneck.

5 Conclusions

We have described a system for 3D visual trackingcapable of achieving real-time performance thanks tothe use of a GPU for parallel computation. The useof the stream processing approach greatly simplifiedthe development issues, and at the same time openedthe door to other computing architectures. The goalsimposed before starting the design (automatic, robust,just one camera, conventional computing resources,multi-object, real-time) have been all achieved, andthe system is currently being used for future research inthe area of conversation scene analysis [1] as expected.The novelty of the proposed work lays not only in theusage of a stream processor for 3D visual tracking,but also in an improved sparse template initializationmethod that improves the accuracy and stability oftracking by means of a simple, generic 3D-model of thehuman face.

Anyone observing the trends in processors’ tran-sistor counts and computing power will notice that,in order to continue achieving performance improve-ments, processor vendors have shifted their strategyfrom increasing clock speed to increasing the numberof cores per processor. Software has, therefore, to keepup with this concurrency drive. Multimedia applicationsin particular are hungry for new computing power, andtheir special characteristics mean that future architec-tures will differ from the ones we are used to. Streamprocessors are a strong candidate to fulfill that de-mand (modern GPUs offer incredible amounts of rawprocessing power). Fortunately, the paradigm drivingthe new hardware can meet this challenge (exposing

Page 10: Real-time Visual Tracker by Stream Processing

294 J Sign Process Syst (2009) 57:285–295

desirable properties of multimedia algorithms, as wellas creating natural ways of implementing them, as inour particle filtering case).

Open Access This article is distributed under the terms of theCreative Commons Attribution Noncommercial License whichpermits any noncommercial use, distribution, and reproductionin any medium, provided the original author(s) and source arecredited.

References

1. Otsuka, K., Yamato, J., Takemae, Y., & Murase, H. (2006).Conversation scene analysis with dynamic Bayesian networkbased on visual head tracking. In Proc. IEEE intl. conf. onmultimedia and expo 2006 (pp. 949–952).

2. Doucet, A., Freitas, N., & Gordon (Eds.) (2001). SequentialMonte Carlo methods in practice. Springer.

3. Isard, M., & Blake, A. (1998). Condensation—conditionaldensity propagation for visual tracking. In Proc. of the 6th intl.conf. on computer vision (pp. 107–112).

4. Owens, J. D., Luebke, D., Govindaraju, N., Harris, M.,Krüger, J., Lefohn, A. E., et al. (2007). A survey ofgeneral-purpose computation on graphics hardware. Com-puter Graphics Forum, 26(1), 80–113.

5. NVIDIA (2007). CUDA (compute unified device archi-tecture) programming guide ver.1.0. http://developer.nvidia.com/object/cuda.html.

6. Peercy, M., Segal, M., & Gertsmann, D. (2006). Aperformance-oriented data parallel virtual machine forGPUs. In Proc. SIGGRAPH 2006.

7. McCool, M. (2007). Multi-core cpus, accelerators, and themany-core future: A unified software approach with rapid-mind. In Proc. SIGGRAPH 2007.

8. Kapasi, U. J., Rixner, S., Dally, W. J., Khailany, B., Ahn, J. H.,Mattson, P., et al. (2003). Programmable stream processors.IEEE Computer, 36, 54–62, August.

9. Venkatasubramanian, S. (2003). The graphics card as astream computer. In SIGMOD-DIMACS workshop on man-agement and processing of data streams.

10. Kapasi, U., Dally, W. J., Rixner, S., Owens, J. D., & Khailany,B. (2002). The imagine stream processor. In Proc. of intl. conf.on computer design (pp. 282–288).

11. Khailany, B., Williams, T., Lin, J., Long, E., Rygh, M., Tovey,D., et al. (2007). A programmable 512 gops stream processorfor signal, image, and video processing. In IEEE internationalsolid-state circuits conference 2007 digest of technical papers(pp. 272–602).

12. Fung, J., & Mann, S. (2004). Computer vision signal process-ing on graphics processing units. In Proc. IEEE internationalconference on acoustics, speech, and signal processing (Vol. 5,pp. 93–96).

13. Montemayor, A. S., Pantrigo, J. J., Sánchez, Á., & Fernández,F. (2004). Particle filter on gpus for real-time tracking. InProc. of ACM SIGGRAPH (p. 94).

14. Montemayor, A. S., Pantrigo, J. J., Cabido, R., & Payne,B. (2006). Bandwidth improved GPU particle filter for vi-sual tracking. In Ibero-American symposyum on computergraphics—SIACG.

15. Arulampalam, S., Maskell, S., Gordon, N. J., & Clapp,T. (2002). A tutorial on particle filters for on-line non-linear/non-Gaussian Bayesian tracking. IEEE Transactionsof Signal Processing, 50(2), 174–188, February.

16. Matsubara, Y., & Shakunaga, T. (2005). Sparse templatematching and its application to real-time object tracking. IPSJ

Transactions on Computer Vision and Image Media, 46(9),17–40.

17. Matsubara, Y., & Shakunaga, T. (2004). Real-time objecttracking by sparse template matching. In IPSJ SIG technicalreport, no. 26 (pp. 49–56).

18. Viola, P., & Jones, M. (2001). Rapid object detection usinga boosted cascade of simple features. In Proc. of the IEEEcomputer society conference on computer vision and patternrecognition (Vol. 1, pp. 511–518).

19. Viola, P., & Jones, M. (2004). Robust real-time face de-tection. International Journal of Computer Vision, 57(2),137–154.

20. Edwards, G. J., Taylor, C. J., & Cootes, T. F. (1998). Inter-preting face images using active appearance models. In Proc.intl. conf. on face and gesture recognition (pp. 300–305).

21. Horn, D. (2005). Stream reduction operations for GPGPUapplications. GPU Gems 2. Addison Wesley.

22. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K.,Houston, M., et al. (2004). Brook for GPUs: Streamcomputing on graphics hardware. ACM Transactions onGraphics, 23(3), 777–786.

23. Stegmann, M. B., Ersbøll, B. K., & Larsen, R. (2003). Fame -a flexible appearance modelling environment. IEEE Trans-actions on Medical Imaging, 22(10), 1319–1331.

24. Ahlberg, J. (2001). Candide-3 – an updated parameter-ized face. Technical report, Dept. of Electrical Engineering,Linköping University.

25. Bolic̀, M., Djuric̀, P. M., & Hong, S. (2005). Resampling algo-rithms and architectures for distributed particle filters. IEEETransactions on Signal Processing, 53(7), 2442–2450, July.

Oscar Mateo Lozano received the B.Sc. and M.Sc. degreesin Electrical Engineering from the Universidad Politécnica deValencia in Spain, in 2004. Currently, he is a Ph.D. candidate atthe Image Processing Group of the Universidad Politécnica deMadrid in Spain. He has researched medical image processing inBrazil, sensor networks in the USA, computer graphics in Spain,video processing in Japan, and has also strong industry “hands-on” experience with networking and embedded architectures.His current main research interests are in the areas of computergraphics and high-performance computing architectures.

Page 11: Real-time Visual Tracker by Stream Processing

J Sign Process Syst (2009) 57:285–295 295

Kazuhiro Otsuka received his B.E. and M.E. degrees in electricaland computer engineering from Yokohama National Universityin 1993 and 1995, respectively. He joined the NTT Human Inter-face Laboratories in 1995. He received his Ph.D. in informationscience from Nagoya University in 2007. He is now a seniorresearch scientist in the NTT Communication Science Labora-tories. His current research interests include computer visionand communication scene analysis. He received the Best PaperAward of IPSJ National Convention in 1998 and the Best PaperAward of IAPR International Conference on Image Analysis andProcessing in 1999. He is a member of the IEEE, the IPSJ, andthe IEICE.