3D Shape Inferencing and Modeling for Video Retrieval

Journal of Visual Communication and Image Representation11, 41–57 (2000)

doi:10.1006/jvci.1999.0418, available online at http://www.idealibrary.com on

3D Shape Inferencing and Modelingfor Video Retrieval

Zhibin Lei

Bell Laboratories, 600 Mountain Avenue, Murray Hill, New Jersey 07974E-mail: [email protected]

and

Yun-Ting Lin

Princeton University, Princeton, New Jersey 08540

Received January 20, 1998; accepted February 4, 1999

We present a geometry-based indexing approach for the retrieval of video databases.It consists of two modules: 3D object shape inferencing from video data and geomet-ric modeling from the reconstructed shape structure. A motion-based segmentationalgorithm employing feature block tracking and principal component split is usedfor multi-moving-object motion classification and segmentation. After segmenta-tion, feature blocks from each individual object are used to reconstruct its motionand structure through a factorization method. The estimated shape structure and mo-tion parameters are used to generate the implicit polynomial model for the object.The video data is retrieved using the geometric structure of objects and their spa-tial relationship. We generalize the 2D string to 3D to compactly encode the spatialrelationship of objects.C© 2000 Academic Press

1. INTRODUCTION

With the growing list of new applications in pictorial information handling and theprogression of multimedia technology, video data is becoming a fundamental resource formodern databases and the problem of efficient retrieval and manipulation of vast amounts ofvideo information has become an important issue. Video database queries require searchingvideo databases for a reference scene or object, taken from different camera positions andunder different conditions. One approach is to create and store some representative framesfor the video sequence. These representative frames are then matched against those stored forthe reference sequences. The disadvantage of this approach is that the rich 3D object shapeand temporal (motion) information are lost. Another approach is to estimate significant 3Dscene and object structures for the video sequence and to compare them to those of thereference video sequences.

41

1047-3203/00 $35.00Copyright C© 2000 by Academic Press

All rights of reproduction in any form reserved.

42 LEI AND LIN

In this paper, we focus on the second approach and present the idea of a geometry-basedindexing scheme which can capture both the object shape structures and their spatial rela-tionship. In order to obtain an individual object’s shape from a video stream, two processingsteps are required:object segmentationandshape reconstruction. Object segmentation isa recently emerging topic which is critical for many content-related video applications. Itsgoal is to segment image features or individual pixels into object clusters. Most existingapproaches assume one or a set of motion models (e.g., affine transform) to gradually extractobjects with different motions, either using a featureless [1, 2, 3] or a feature-based approach[4, 5]. Featureless approaches are often computation intensive and do not offer direct es-timates of the 3D shape of each segment at the end. For both approaches, the assumptionof a particular motion model often restricts the object structure and therefore limits theirgenerality for object shape modeling. For instance, an affine transform restricts the objectto have a planar structure. Recently, a principal-component-based clustering algorithm wasproposed [6] to find multiple object motion clusters under a rich collection of motion mod-els. The resulting clusters are suitable for reconstructing 3D shapes of individual movingobjects in the scene. After clustering the features into objects, the individual object shapecan be reconstructed, given the correspondences of a sufficient set of features. This is theso-called structure-from-motion problem [7–10] in computer vision. Algebraic solutionsbased on constraints from photogrammetry are usually available if the feature correspon-dence problem and multibody separation (i.e. object segmentation) problem can be solved.Therefore, to extract an individual object’s shape for content-based video retrieval, it isnatural to apply motion tracking to find such feature correspondences and then group thefeatures into object clusters for shape reconstruction. A robust motion-tracking algorithmis required to provide such feature correspondence.

We apply a factorization method [11] to estimate the single-body 3D motion and shapefor each object and assume the para-perspective projection model. The 3D shape of eachobject is then represented by an implicit polynomial. A set of geometric features (invariantsof implicit polynomials) is computed and used later as the feature index vector for thequeries. We treat individual geometric shape estimates and their spatial relationship at thesame time to reduce the ambiguity and processing time.

The geometric tools that we use for representing object shapes are implicit polyno-mial functions. Implicit polynomial (IP) representations and their algebraic invariants haveplayed an important role in many graphics, image processing, and computer vision applica-tions [12–18]. Recent works [19, 20] opened up new ways to do more robust and efficienthigh degree implicit polynomial fits to given data sets. Implicit polynomial representationshave many desired properties which make them ideal tools for modeling geometric shapestructures. For instance, implicit polynomials consume very little storage space (only thecoefficient vector of the polynomial functions) and they have the interpretation power for themissing data. Their algebraic invariants [17, 15], functions of the polynomial coefficientswhich retain the same values after transformation of the underlying coordinate system, canserve as the primal invariant geometric shape structure for the object. Another distinctiveproperty of IP representation is the availability of a Bayesian recognition engine [17]. Ittreats the recognition problem as the problem of maximization of the Bayesian a posteriorprobability. Shape invariant feature vectors are used to compute a distance value whichmeasures how close two shapes match. Because of the underlying ambiguity in the imageand video data, for example, image formation and quantization errors, camera positions,

SHAPE INFERENCING AND MODELING 43

and lighting conditions, the recognition problem should be treated in general as the recog-nition of a class of similar shapes, instead of a single shape. Thus, the Bayesian frameworkprovides an ideal tool for this purpose.

The rest of the paper is organized as follows. In Sections 2.1 and 2.2, we discuss theproblem of motion-based scene segmentation and 3D motion shape reconstruction, whichis the first step toward processing the video data. We then discuss in Section 2.3 the im-plicit polynomial models that we use as the general geometric shape representations, theirinvariant features, and the Bayesian recognizer based on them.

In Section 2.4 we discuss an approach for obtaining implicit polynomial representationsfrom video sequences. It is based on fitting the model to the reconstructed 3D data set. Ituses the motion parameters estimated in Section 2.2. Section 3 briefly discusses the idea ofgeometry-based indexing that uses the invariant geometric features of implicit polynomialrepresentations. Section 4 draws conclusions and discusses future work.

2. OBJECT SEGMENTATION AND REPRESENTATION

2.1. Multiple Motion Classification and Object Segmentation

Normally, a video scene contains multiple objects of interest. In order to model the shapeof each object, a segmentation algorithm is needed that separates individual objects fromthe scene. An object of interest in the video scene usually has its own motion, or it is a close-range object whose appearance changes under camera motions. In both cases, the objectcan be identified through its relative motion with respect to the background. This sectionoutlines our approach for motion-based object segmentation. First 2D motion vectors of anondense motion field is estimated using a feature block tracking algorithm (Section 2.1.1).These motion vectors from multiple objects are then classified into different motion/objectclusters using a principal component clustering algorithm (Section 2.1.2).

2.1.1. Feature Block Tracking

Motion tracking has been a very active topic in image/video processing and computervision for decades. Unlike motion-compensated video coding, motion classification andsegmentation rely on accurate estimation of the motion field. However, since motion track-ing is an ill-posed problem, an ideal solution is usually not attainable. The solution is oftenaffected by the constraints on motion smoothness or discontinuity, as well as the image struc-ture. Among the existing methods, optical flow algorithms [21] produce an over-smootheddense motion field and do not work well for fast object motion. Simple block matchingalgorithms often fail to find correct motion vectors. In our application, to extract 3D objectshape structure for building implicit polynomial model, a sufficient number of good motionvectors are needed. We estimate motion vectors from a feature block tracking algorithm[22] to provide such feature correspondence. This approach uses neighborhood relaxationand multicandidate prescreening to produce a nondense motion field. The best motion vec-tor for thek-th feature blockBk is chosen from a pool of possible motion candidates byminimizing a cost function defined as

C(Bk, v) = R(Bk, v)+∑

Bj∈N(Bk)

(W(Bj , Bk)×min

γR(Bj , v+ γ)

), (1)

44 LEI AND LIN

wherev denotes a candidate motion vector forBk. The best motion vectorv∗(Bk) is

v∗(Bk) = arg(min

vC(Bk, v)

). (2)

Here R(Bk, v) is the residue(or displaced block difference), which is usually the meansquare error between blockBk in frame t and the block displaced byv in frame t + 1.The neighborhood systemN(Bk) contains four neighbors of the reference blockBk. EachneighborBj of Bk ( j 6= k) has its own weighting factor denoted byW(Bj , Bk), depending onthedistancebetween blockBj andBk, theirsimilarity (e.g. intensity, color, texture), and thefeature strength(e.g. gradient) of reference blockBk. The introduction ofγ (neighborhoodrelaxation) allows possible variation of a candidate motion vectorv for the neighborhoodN(Bk). In other words, it allows the neighboring blocks to have a similar but not necessarilythe same motion vector as the reference block. This makes the model more flexible infinding the best motion vector when the object has a nontranslational type of motion. Theneighborhood relaxation provides local motion smoothness while the weighting systemallows motion discontinuity within the neighborhood by giving different weights to eachneighbor of the reference block. In our implementation, the feature block size is chosento be 8× 8 and the search range is set to the maximum possible displacement for a block,usually [−16, 16]× [−16, 16]. Initially the frame is partitioned into default 8× 8 blocks.Thresholding on the variance of the block and the cost of the winner motion candidate isapplied to prune unreliable motion estimates. As a condition, in order to obtain a sufficientnumber of features to work with, the target objects have to have distinctive image featureson their surface. This is a common restriction for most feature tracking algorithms. Forhomogeneous objects lacking texture details, an approach to detect and track object edgesmay be required. In general, this tracking algorithm can provide good motion estimates oninterior surface feature blocks.

2.1.2. Motion Classification of Feature Blocks

Given the motion vectors of a selected set of features, a principal-component clusteringalgorithm [6] is applied to group these features into object clusters. Note that these featurescan be blocks, edges, or any image structures extracted by the tracking scheme. The lowerlevel motion vectors are grouped into higher level objects for shape modeling. LetK denotethe number of feature blocks selected by the tracker and letF be the number of framesover which these feature blocks are tracked. The motion vectors can be stored in thefeaturedisplacement matrixD,

D =

d(1, 1) d(1, 2) · · · d(1, K )...

......

d(F − 1, 1) d(F − 1, 2) · · · d(F − 1, K )

, (3)

where

d( f, p) =[

x( f + 1, p)− x( f, p)

y( f + 1, p)− y( f, p)

]for p= 1, . . . , K and f = 1, . . . , F − 1,where (x( f, k), y( f, k)) is the location of blockBk


in framef. The spatial locality of the feature blocks are stored in thefeature measurementmatrixW,

W = [w(1, 1) w(1, 2) · · · w(1, K )], (4)

where

w(1, p) =[

x(1, p)

y(1, p)

]for p=1, . . . , K .

The motion feature for motion classification is defined as

V+ =[

vw1

vd1

]= [v(1) v(2) · · · v(K )], (5)

wherevw1 andvd1 denote the principal component (or first right singular vector) of matrixW andD, respectively. Note thatV+ is a 2× K matrix and each column vectorv(i ) iscalled themotion featureof thei th feature block. It has been shown in [6] that the PC-basedmotion feature will form distinctive motion clusters in the two-dimensional motion featurespace, assuming that an individual object has a different motion trajectory in the sceneand a sufficient number of feature blocks are selected from each object of interest. Thedistribution of motion features can be modeled by a mixture of Gaussian classes with i.i.d.samples. Therefore,

p(v(i )) =K∑

r=1

P(2r )p(v(i ) | 2r ), (6)

where2r represents ther th cluster andP(2r ) denotes theprior probability of clusterr. Bydefinition,

∑Kr=1 P(2r )= 1. The component densityp(v(i ) | 2r ) is a Gaussian distribution,

p(v(i ) | 2r ) ≡ N(µr ,Σr ) =exp

[− 12(v(i )− µr )TΣ−1

r (v(i )− µr )]

(2π )d/2|Σr |1/2 , (7)

whereµr andΣr denote the mean vector and the full-rank covariance matrix of ther thcluster, respectively. The dimensionalityd= 2 in our case.

The objective is to maximize the complete log-likelihood function of data setV+ w.r.t.the parameter setw≡{E(zr (i )), P(2r ), µr ,Σr },

lc(w; V+) =N∑

i=1

∑r

zr (i ) log[P(2r )p(v(i ) | 2r )], (8)

wherezr (i ) is theindicator variablewhich takes binary values (1 or 0):zr (i )= 1 iff patterni is generated by clusterr .

We apply the EM algorithm [23], which is a special kind of quasi-Newton algorithmwith a searching direction having a positive projection on the gradient of the log-likelihood.Each EM iteration consists of two steps: estimation (E) step and maximization (M) step.The M-step maximizes a likelihood function which is refined in each iteration by the E-step.

46 LEI AND LIN

At iteration j, the E-step takes the expectation of the complete-data likelihood,

Q(w,w( j )

) = E[lc(w; V+) | V+,w( j )

]= E( j )

[N∑

i=1

∑r

zr (i )[log P(2r )+ log p(v(i ) | 2r )]

]

=N∑

i=1

∑r

h( j )r (i )[log P(2r )+ log p(v(i ) | 2r )], (9)

where we definep( j )(v(i ) | 2r )≡ N(µ( j )r ,Σ( j )

r ) and

h( j )r (i ) ≡ E( j )[zr (i )]

= P( j )(zr (i ) = 1 | v(i ))

= P( j )(zr (i ) = 1)p( j )(v | zr (i ) = 1)

p( j )(v(i ))

= P( j )(2r )p( j )(v(i ) | 2r )∑k P( j )(2r )p( j )(v(i ) | 2k)

(10)

= p( j )(v(i ) | 2r )∑k p( j )(v(i ) | 2k)

, (11)

assuming all the priors are equal. The meaning ofhr (i ) is that it is the probability for patterni to be generated from clusterr . After the E-step, in the M-step of the EM algorithm wemaximizeQ(w,w( j )) w.r.t. w and have

P( j+1)(2r | ω) = 1

N

N∑i=1

h( j )r (i )

µ( j+1)r =

∑Ni=1 h( j )

r (i )v(i )∑Ni=1 h( j )

r (i )

Σ( j+1)r =

∑Ni=1 h( j )

r (i )[v(i )− µ( j )

r][

v(i )− µ( j )r]T∑N

i=1 h( j )r (i )

. (12)

When the EM iteration converges, it should ideally obtain the maximum likelihood estima-tion (MLE) of the data distribution. EM has been reported to deliver excellent performance inseveral data-clustering problems. Using EM for clustering the motion features also providesa “soft pruning” mechanism because motion features with low class conditional probabilityhave less influence on the updating of the class parameters.

2.1.3. Initial Condition

To determine the number of object clusters and initialize the classification process, VQor K-mean can be used. Both VQ and K-mean are commonly used unsupervised clusteringtechniques. One limitation of the K-mean algorithm is that the number of clusters needsto be decided beforehand. On the other hand, VQ does not assume the number of theclusters is known, but it assumes a distance threshold between different classes. In practice,


it is often difficult to automatically decide the total number of classes,K , solely from thedata distribution. One possible method to decideK is by calculating the following energyfunction and choose the one resulting in the lowest energy,

E = −N∑

i=1

log

[∑r

p(v(i ) | 2r )

]+ K Ep, (13)

whereEp is the incurred energy for penalizing the increase ofK . This is similar to theMDL (minimum description length) principle [24]. In fact, this operation simply shifts theburden of decidingK to determining the penalty termEp. Simulations show that decidingK , the number of classes, based on Eq. (13) does not necessarily find the actual number ofmajor motion classes. In the experiments we have conducted,K is regarded as known.

Performing the EM algorithm to cluster motion features need not be to assume anyspecific motion model. The clustering is solely determined by the distribution of motionfeatures. When a specific motion model is assumed to suitably describe the object motion, ageneralized EM (or GEM) can be applied, and the motion parameters and the segmentationcan be alternatively estimated. The principal component-based scheme has been shown tosuccessfully separate multiple moving objects in the scene, or objects at a different depthunder a moving camera, provided the motion tracking offers accurate feature correspon-dences of the foreground moving objects. It performs well when the individual movingobject has its distinctive motion trajectory or the objects are spatially separate in their 2Dprojections, since this assumption implies a larger inter-class distance in the motion featurespace. For a nonrigid object, however, this algorithm has very limited capability, since itsalgebraic basis assumes a rigid object structure.

Figure 1 shows the feature block tracking and the motion classification results of asequence containing three moving coffee cans.

FIG. 1. (a) Tracking of cafe cans in three frames; (b) classification of feature blocks into three different groups.“Coca-Cola” is a trademark of The Coca-Cola Company.

48 LEI AND LIN

2.2. 3D Motion and Shape Reconstruction from Feature Point Correspondence

After feature blocks are classified into different motion classes, the feature points fromthe same class of the object can be used to reconstruct the 3D motion and shape structurefor this object. An object’s 3D structure can be specified by the 3D locations of its featurepoints. LetS0 denote this structure:

S0 = X1 X2 XN

Y1 Y2 · · · YN

Z1 Z2 ZN

. (14)

Object motion can be described by a rotation matrixR (orthonormal, 3× 3) and a trans-lation vectorT = [tx, ty, tz]T. At framet , the location of a feature point on objectS0 can berepresented asS(t)= R(t)S0+T(t)E (E is a vector of all ones). The projection of a featurepoint can be written asP(S(t))= P(R(t)S0+ T(t)E), whereP is the projection that mapsa 3D point (a column inS(t)) to a 2D image point.

The problem of 3D motion and shape reconstruction from feature point correspondence isdefined as follows. Given the projections of a group of feature points throughF frames, wewant to findS0, R(t), andT(t) that generate the trajectories of feature points in the images.A para-perspective model (cf. Fig. 2) is used to approximate the perspective projectionso that the solution can be obtained by solving a linear system. This model “bends” theprojection line to be in parallel with the line passing through the object center and thefocus. It approximates perspective projection better when the ratio of the object size and itsdistance to the camera becomes smaller.

The factorization method proposed by Kanadeet al. [11] is then used to reconstruct the 3Dmotion and shape structure for the object. It estimates object rotation (or orientation)R(t),object translation (or location as camera-centered)T(t), and object shapeS0 (representedas object-centered). The reconstructed shape is up to a scale factor. This does not affect ourshape representation using the implicit polynomial since scale invariants can be used andthe relative scale is sufficient for the matching of multiple objects in the scene.

FIG. 2. An illustration of para-perspective projection in two directions. Here dotted lines represent perspectivelines while boldface lines denote parallel lines.


2.3. Geometric Shape Representation: Implicit Polynomials and Invariants

Geometric shape is a key ingredient of any real, practical, versatile pictorial databasesystem, yet little has been done in this regard. In this paper, we explore the potential ofgeometric shape representations for object structure estimation from video sequences. Inthe geometric shape modeling domain, B-splines are one of the most successfully andwidely used representations. However, it is not our best choice here. The geometric modelthat we use should facilitate the efficient indexing and querying of the database. AlthoughB-splines can represent geometric shape structures accurately and are very useful for graph-ics, their effectiveness in dealing with missing data, matching ofapproximatelysimilarshapes, and unknown coordinate transformations, at low computational cost, has not beenestablished.

Implicit polynomial (IP) models, i.e., algebraic curves and surfaces, have been success-fully applied to many object shape modeling and recognition problems [19, 14, 20, 12, 13,17, 18]. That is because of their interpolation properties, Euclidean and affine invariants(which permit pose invariant low computational cost recognition of shapes), and robustBayesian recognizers. An implicit polynomial of degreen in 3D is a polynomial functionf (x, y, z) = 0, where

f (x, y, z) =∑

i, j,k≥0,i+ j+k≤n

ai jk xi y j zk;

hereai jk ’s are the coefficients. An implicit polynomial functionf (x, y, z) = 0 is a repre-sentation of a shape (object)S={(xl , yl , zl ) | l = 1, . . . , N} if every point of the shapeSis on the zero set of the implicit polynomialZ( f )={(x, y, z) | f (x, y, z) = 0}.

It is customary to use polynomial fitting procedures to obtain the best implicit polyno-mial representation that minimizes the mean squared distance (error) from the given dataset to the zero set of the implicit polynomial [19]. If first-order distance approximationd(zi , z( f ))= | f (zi )|/‖∇ f (zi )‖ is used, the mean square distance becomes [18]

d2 = 1

N

N∑i=1

d2(zi , Z( f )) = 1

N

N∑i=1

| f (zi )|2‖∇ f (zi )‖2 . (15)

This is a nonlinear optimization problem. Recent results on implicit polynomial fittingproblem [19, 20], three-level implicit polynomial (3L) fitting and linear programming (LP)fitting, have opened up new ways to do more robust and efficient high degree implicitpolynomial fits to the given data sets.

Another property of implicit polynomial shape representation is that it can representmultiple objects in a scene at the same time. Assume that there are two objectsS1 andS2

and that their algebraic shape models aref1(x, y, z)= 0 and f2(x, y, z)= 0, respectively.Now implicit polynomial modelf1∗ f2(x, y, z)= 0 will represent both objects. This featureis very useful when there is uncertainty in the segmentation of objects. However, this alsoincreases the complexity of the implicit polynomial model. In our application, as discussedearlier, we use object motion information to reconstruct individual object shape structureand then we obtain its implicit polynomial model. Thus, it is not a problem for us here.

Once the IP models are obtained for the object shapes, we can change the problem ofobject shape comparison into the problem of IP model comparison. However, because of theunknown transformations of the coordinate system, we cannot compare the coefficient vector

50 LEI AND LIN

of those IP models directly. We use invariants of IP models instead.Algebraic invariantsarefunctions of polynomial coefficients which remain constant with respect to a certain groupof coordinate system transformations. They are invariant shape features and not dependenton the underlying coordinate system.

Let A be a transformation of the coordinate system, letα be the coefficient vector of theobject IP model in the old coordinate system, and letα′ be in the new coordinate system. Ifs(α) satisfiess(α′)= |A|w · s(α), thens is called arelative invariant of weightw. If w= 0,thens is anabsoluteweight invariant [17, 14, 15]. An example Euclidean relative invariantof weight 4 is [17] 2∗ a2

121 ∗ a202− 6 ∗ a112 ∗ a130 ∗ a202+ 12∗ a040 ∗ a2202−a112 ∗ a121 ∗

a211+ 9 ∗ a103∗ a130∗ a211− 6 ∗ a031∗ a202∗ a211+ 2 ∗ a022∗ a2211+ 2 ∗ a2

112∗ a220− 6 ∗a103∗a121∗a220+ 4∗a022∗a202∗a220− 6∗a013∗a211∗a220+ 12∗a004∗a2

220− 36∗a040∗a103∗ a301+ 9∗ a031∗ a112∗ a301− 6∗ a022∗ a121∗ a301+ 9∗ a013∗ a130∗ a301+ 9∗ a031∗a103 ∗ a310− 6 ∗ a022 ∗ a112 ∗ a310+ 9 ∗ a013 ∗ a121 ∗ a310− 36∗ a004 ∗ a130 ∗ a310+ 12∗a2

022∗ a400− 36∗ a013∗ a031∗ a400+ 144∗ a004∗ a040∗ a400. An absolute invariant can beobtained as the ratio of two relative invariants of the same weight.

We use IP shape models for the Bayesian object recognition problem. Letp(Z |αl )denote the probability of the data setZ, given an object modeled by an implicit polynomialhaving coefficient vectorαl . In the simplest Bayesian recognition scenario, where there isa set ofL objects, labeled byl = 1, 2, . . . , L , each modeled by a single polynomial withcoefficient vectorαl , given a data setZ={Z1, Z2, . . . , ZN}, the minimum probability oferror recognition of the object type is to choosel for which p(Z |αl ) is maximum. This,however, requires considerable computation because the raw dataZ is processed a total ofL times for eachl . Using an asymptotic approximation [17] we have a computationallyattractive recognition rule: “choosel for which (16) is maximum”:

p(Z | αl ) ≈ P(Z | αN) exp

{−1

2(αl − αN)t9N(αl − αN)

}. (16)

Here,αN is the maximum likelihood estimate ofα and it is the coefficient vector of thebest polynomial fit to the data.9N is an information matrix that depends on the specificdata setZ being recognized. Sincep(Z | αN) is the same for alll , maximizing the functionin (16) is equivalent to minimizing the quadratic form (αl − αN)t9N(αl − αN).

This minimum Mahalanobis distance recognition rule is a low computational cost recog-nition rule, since the data setZ is involved just once to compute ˆαN and the informationmatrix9N . It gives approximately the same result, as doesp(Z |αl ) when the number ofdata pointsN is at least a few times larger than the number of polynomial coefficients. Toextend to recognition based on invariants, letGl be the vector of invariants stored in thedatabase and letG be the vector of invariants for the best fitting polynomial to the data, (16)can be extended so that the recognition rule becomes “choosel for which (17) is maximum[17],”

p(Z | Gl ) ≈ p(Z | G) exp

{−1

2(Gl − G)t9G(Gl − G)

}, (17)

where9G is computed from9N and the functionG(αN). Hence we can use the Mahalanobisdistance between two invariant vectors to measure the similarity between two object shapes.

We have tested the recognition engine on a group of shapes with their reconstructedresults from the synthetic video sequences (Fig. 3) and the results are listed as follows.


FIG. 3. Four object shapes: bulb, cube, egg, and heart (first row) and their corresponding reconstructed shapesfrom the synthetic video sequences (second row).

From the table (Fig. 4) we can see that, in general, the similarity value is smaller for thecorresponding reconstructed shape.

2.4. IP Model Building from the Video Sequence

In this section, we introduce a method for obtaining implicit polynomial shape modelsfrom video sequences. The method, named thefeature points reconstruction method, re-constructs object structure by estimating the 3D location of a set of feature points on theobject.

We assume that there exists a set of detectable feature points on the object surface. Thesefeature points are tracked throughout a video sequence to establish the point correspon-dence by the tracking algorithm introduced in Section 2.1.1. The tracked feature points areclassified into different object groups by applying the principal component based clusteringalgorithm described in Section 2.1.2. We assume that the objects are rigid and relatively farfrom the camera so that perspective distortion can be ignored. As mentioned in Section 2.2,under the para-perspective projection model, the 3D motion and structure of each objectare estimated using a factorization method. The structure of these feature points in termsof the 3D locations can then be used for the implicit polynomial fitting procedure to obtainthe IP model of the object. Figure 5 shows the process of the feature points reconstructionmethod. In Figs. 6 and 7, the reconstruction results are shown for the mug and doll videosequences. Figure 7 also shows the IP fitting for the reconstructed shape structure.

FIG. 4. Similarity values between shapes in Fig. 3 and their corresponding reconstructed shapes from syntheticvideo sequences. Values are normalized by the similarity value of the correct match.

52 LEI AND LIN

FIG. 5. Object segmentation and generation of implicit polynomial representation.

Depending on the location and motion of the object relative to the camera, not all thefeature points on the object surface can be tracked through all the video frames. It is necessaryto combine the 3D reconstruction results from several different views of the object so thatdifferent portions of the object surface can be covered. Since the shapes estimated fromdifferent sets of video frames have different reference coordinates, we need to find theproper coordinate transformations to register the reconstructed shapes from many differentsets of frames. This registration will rely on the common feature points shared by differentsets of frames.

For example, ifS1 andS2 denote the shape reconstructed by using the tracking data fromframes 1 to 3 and 4 to 6, respectively,S′1 andS′2 denote the common feature points sharedby S1 andS2 that are extracted from matrixS1 andS2, respectively. In order to registerS1

andS2, we need to find the scaling factora, the rotation matrixR, and the translation vectorT , so that the error metric

‖S′1− (a ∗ R ∗ S′2+ TE)‖2

is minimized. After registering all pairs of shape estimates, our final goal is to minimize themodel representation error:

err=F∑

i=1

Ni∑j=1

ρ(pi

j

)f 2(GNi N1 ∗ pi

j

).

HereF is the total number of framesNi is the number of feature points in framei , ρ is theweight associated with a reconstructed 3D point, andGNi N1 transforms each 3D pointpi

j

reconstructed from framei from its coordinate system into the coordinate system of frame 1.The feature points reconstruction method needs to track the feature points on the object

surface, which must have a rich texture structure or distinctive feature points. Anotherproblem associated with this method is that reconstruction errors in three directions aretreated equally. They are all combined into theX,Y, andZ coordinates of the reconstructed3D points. But the reconstruction error comes mostly from the depth estimation, which caninfluence the estimation in the other two directions.

FIG. 6. Tracking of a mug over five frames and the reconstruction of 24 feature points.


FIG. 7. Tracking of a doll toy over three frames and the IP fitting for the set of the reconstructed featurepoints.

Another approach, the projection comparison method, does not require the tracking of afull set of feature points. In this method, a small set of rather reliable feature points (linesegments or curves) are tracked to estimate the object motion and its relative location tothe camera. The 3D object model is obtained by minimizing the error between the modelprojection along the estimated projection direction and the shape contour in the image,which can be tracked using the standard deformable contour tracking methods [25–27]. Weare currently working on this approach.

3. GEOMETRY BASED INDEXING

Object shape structure is one of the most important features that one can use to querya pictorial database. Figure 8 illustrates various modules involved in a typical pictorialdatabase application. A query example (key frame, video clip, or sketch) is presented to thesystem. First, useful object shape and structure information is obtained via the data process-ing stage (segmentation and reconstruction; see Section 2). A geometric model (includingshape models and spatial layout model) for the given query sample is then generated, basedon results of the previous stage. In our approach, we used implicit polynomial shape models.

FIG. 8. Geometric indexing of large pictorial databases.

54 LEI AND LIN

Finally geometric invariants vector and spatial layout graph representation (e.g. 3D string)are used for similarity-based searching of image/video databases.

The need to query large relational databases stimulated the research in indexing schemesand algorithms. Various indexing tree structures (e.g., R-tree, KD-tree) have been developedand they recursively divide the search space into smaller regions so that the search can bedone efficiently. The idea of 2D strings for image data application was first introduced byChanget al.[28] as an iconic indexing method for pictorial databases. Database images andquery images are both expressed as a pair of strings (in two directions of the image plane),called 2D strings, to allow simple matching schemes for the queries. Later works add morestructures and generalize this representation. The 2D strings method was extended to indexvideo in [29], where the first frame is encoded as a number of sets of objects and subsequentframes are represented as a sequence of edits to the initial sets. Tree-like spatial data modelsas well as more general image algebra representations, have also been introduced recently.The book by Chang and Jungert [30] has a more detailed account of recent progress inimage iconic indexing.

The VisualSEEksystem developed by Smith and Chang [31] segments and encodesdifferent visual regions (color, texture) from input images or video clips, and adopted a 2Dstring approach for fast indexing. Here in our application, we want to represent and encodeobject spatial layout and shape feature information. We simply extend a 2D string to a 3Dstring by adding another string sequence in the third dimension (depth) which encodes therelative object location in this direction. Since the 3D string only encodes the object spatialinformation, we have to add individual object shape features into the 3D string in order todistinguish different object shape structures as well.

In Section 2, multiple objects are segmented and their motions and shapes are estimated.We generate the 3D string representation for the overall object configuration of the wholescene. For example, Fig. 9 shows the estimated configuration of the coffeecan scene in Fig. 1.The small sphere, cube, and cone represent three coffee cans, respectively. Their relativepositions are shown across three different representative frames. The 3D string correspond-ing to this configuration can then be written as (A< B<C, A>C> B, A=C< B). HereA< B means objectA’s coordinate value is less than that of objectB in a certain axis. Wethen encode the individual object shape information, which is represented as a feature vectorof algebraic invariants (v) of its implicit polynomial representation, into the 3D strings sothat further comparison based on the object shape is possible if the configurations of twoscenes are similar. For example, the complete 3D string for Fig. 1 would be

((A, v(A)) < (B, v(B)) < (C, v(C));

(A, v(A)) > (C, v(C)) > (B, v(B));

(A, v(A)) = (C, v(C)) < (B, v(B)); ).

FIG. 9. Relative location of three cofee cans in Fig. 1 across three representative frames. Small sphere, cube,and cone represent three different coffee cans.


4. CONCLUSIONS AND FUTURE WORK

Semantic video retrieval requires the knowledge of the content of the video data, i.e.,what objects are present in the scene and their location (spatial information) and motion(temporal information) with respect to each other. We have presented in this paper a generalapproach that extracts useful 3D object shape and motion information and automaticallybuilds the geometric models (implicit polynomials) from them. Algebraic invariants forshape models can be computed and stored as feature vectors for future query processing. Amethod for automatically building the geometric shape models is discussed, too.

Our method is still very limited in many ways. With this paper we intend to propose aprimitive video retrieval system, annotated by 3D object shapes and their spatial relationship.To this end, many assumptions have been made. A lot of video clips of natural scenes donot satisfy the rigid-body assumption. To represent and extract deformable object motionis still an open problem. One main challenge lies in finding the feature correspondence ofdeformable objects. In fact, robust motion tracking is very important in our approach. The3D shape reconstruction and registration of partial object views from multiple frames cansuffer from the motion tracking error and affect the final result. Further work, therefore,needs to explore the stability and robustness issues in motion tracking and model building.A powerful indexing scheme, built upon the geometric models for a real video database iscrucial, also, which is not addressed in the paper.

ACKNOWLEDGMENTS

The authors gratefully acknowledge the helpful suggestions and comments from the reviewers in the revisionof the paper.

REFERENCES

1. H. S. Sawhney, S. Ayer, and M. Gorkani, Model-based 2D & 3D dominant motion for mosaicing and videorepresentation, inIEEE Int’l Conf. on Computer Vision, Cambridge, MA, June 1995.

2. N. Diehl, Object-oriented motion estimation and segmentation in image sequences,Signal Process.: ImageCommun.3, No. 1, 1991, 23–56.

3. M. M. Chang, M. I. Sezan, and A. M. Tekalp, An algorithm for simultaneous motion estimation and scenesegmentation, inProc. IEEE ICASSP, April 1994, Vol. V, pp. 221–223.

4. D. W. Murray and B. F. Buxton, Scene segmentation from visual motion using global optimization,IEEETrans. Pattern Anal. Mach. Intell.9, 1987, 161–180.

5. J. Y.-A. Wang and E. H. Adelson, Representing moving images with layers,IEEE Trans. Image Process.3,No. 5, 1994, 625–638.

6. Y.-T. Lin, Y.-K. Chen, and S. Y. Kung, A principal component clustering approach to object-oriented motionsegmentation and estimation,J. VLSI Signal Process. 17, No. 2/3, 1997, 163–187.

7. C. Tomasi and T. Kanade, Shape and motion from image streams under orthography: A factorization method,Int. J. Comput. Vision9(2), 1992, 330–334.

8. H. C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections,Nature293,1981, 133–135.

9. H. S. Sawhney, J. Oliensis, and A. R. Hanson, Image description and 3-D reconstruction from image trajectoriesof rotational motion,IEEE Trans. Pattern Anal. Mach. Intell.15, No. 9, 1993, 885–898.

10. R. Y. Tsai and T. S. Huang, Uniquencess and estimation of three-dimensional motion parameters of rigidobjects with curved surfaces,IEEE Trans. Pattern Anal. Mach. Intell.6, No. 1, 1984.

11. C. J. Poelman and T. Kanade,A Paraperspective Factorization Method for Shape and Motion Recovery,Technical Report CMU-CS-92-208, CMU, 1992.

12. C. M. Hoffmann, Implicit Curves and Surfaces in CAGD,IEEE Comput. Graphics Appl., 1993.

56 LEI AND LIN

13. D. Keren, D. B. Cooper, and J. Subrahmonia, Describing complicated objects by implicit polynomials,IEEETrans. Pattern Anal. Mach. Intell., 1994.

14. Z. Lei, H. Civi, and D. B. Cooper, Free-form object modeling and inspection, inProceedings, AutomatedOptical Inspection for Industry, SPIE’s Photonics, China ’96, Beijing, China, November 1996.

15. Z. Lei, D. Keren, and D. B. Cooper, Computationally fast Bayesian recognition of complex objects basedon mutual algebraic invariants, inProceedings, International Conference on Image Processing, Washington,D.C., October 1995.

16. T. W. Sederberg and D. C. Anderson, Implicit representation of parametric curves and surfaces,ComputerVision, Graphics, and Image Processing, 28, 1984.

17. J. Subrahmonia, D. B. Cooper, and D. Keren, Practical reliable Bayesian recognition of 2D and 3D objectsusing implicit polynomials and algebraic invariants,IEEE Trans. Pattern Anal. Mach. Intell., 1996, 505–519.

18. G. Taubin, Estimation of planar curves, surfaces and nonplanar space curves defined by implicit equations,with applications to edge and range image segmentation,IEEE Trans. Pattern Anal. Mach. Intell., 1991.

19. Z. Lei, M. M. Blane, and D. B. Cooper, 3L fitting of higher degree implicit polynomials, inProceedings,Third IEEE Workshop on Applications of Computer Vision, Sarasota, Florida, December 1996.

20. Z. Lei and D. B. Cooper, New, faster, more controlled fitting of implicit, polynomial 2D curves and 3Dsurfaces to data, inProceedings, Computer Vision and Pattern Recognition Conference, San Francisco, CA,June 1996.

21. J. L. Barron, D. J. Fleets, and S. S. Beacuchemin, Systems and experiment: Performance of optical flowtechniques,Int’l. J. Comput. Vision13, 1994, 43–77.

22. Y.-K. Chen, Y.-T. Lin, and S. Y. Kung, A feature tracking algorithm using neighborhood relaxation withmulti-candidate pre-screening, inProc. ICIP’96, Lausanne, Switzerland, Sep. 1996.

23. A. P. Dempster, N. M. Laied, and D. B. Rubin, Maximum likelihood from incomplete data via the EMalgorithm,J. R. Statist. Soc. B39, 1976, 1–38.

24. J. Rissanen, @ARTICLEA universal prior for integers and estimation by minimum description length,TheAnnals Statistics11, No. 2, 1983, 416–431.

25. M. Kass, A. Witkin, and D. Terzopoulos, Snakes: Active contour models,Int. J. Comput. Vision, 1988,321–331.

26. C. W. Ngo, S. Chan, and K. F. Lai, Motion tracking and analysis of deformable objects by generalized activecontour models, inSecond Asian Conference on Computer Vision, Vol. 3, pp. 442–446, 1995.

27. F. Leymarie and M. D. Levine, Tracking deformable objects in the plane using an active contour model,IEEETrans. Pattern Anal. Mach. Intell.PAMI-15 , No. 6, 1993, 617–634.

28. S. Chang, Q. Shi, and C. Yan, Iconic indexing by 2D strings,IEEE Trans. Pattern Anal. Mach. Intell., 1987,413–428.

29. T. Arndt and S. Chang, Image sequence compression by iconic indexing, in1989 IEEE Workshop on VisualLanguages, pp. 177–182, IEEE Computer Society, October 1989.

30. S.-K. Chang and E. Jungert,Symbolic Projection for Image Information Retrieval and Spatial Reasoning,Academic Press, San Diego, 1996.

31. J. R. Smith and S.-F. Chang, VisualSEEk: A fully automated content-based image query system, inProceed-ings, ACM Multimedia Conference, Boston, MA, 1996.

ZHIBIN LEI received the B.S. degree in mathematics from Beijing University, China in 1989, the M.S. degreein electrical engineering from Brown University, Providence, Rhode Island, in 1994, the M.S. degree in appliedmathematics and the Ph.D. degree in electrical engineering from Brown University in 1997. He was the MeritoriousAward winner of a mathematical contest in modeling sponsored by SIAM in 1989 and held a CRM-UBC fellowship


for Mathematical Biology Summer School at the University of British Columbia, Vancouver, Canada in 1993.During the summer of 1996 he was with Panasonic Technologies Inc., working on digital libraries and Web-basedinformation retrieval. He joined Bell Laboratories, Lucent Technologies, as a member of technical staff in 1997. Hisresearch interests include machine vision and image processing, computer graphics, and multimedia informationapplications.

YUN-TING LIN received her B.S. degree from National Taiwan University in 1992, and her M.A. and Ph.D.degrees in electrical engineering from Princeton University at Princeton, New Jersey in 1996 and 1998, respectively.Since January 1998, she has been working with Digital Video Express, L. P. at Herndon, Virginia as a researchscientist. Her primary research interests include motion estimation and classification, automatic video objectsegmentation, MPEG-4, 3D motion/shape reconstruction and modeling, object tracking, and digital watermarking.

Documents

3D Shape Inferencing and Modeling for Video Retrieval