3D object pose form clustering with multiple views

Pattern Recognition Letters 3 (1985) 279-286 July 1985 North-Holland

3D object pose from clustering with multiple views

George STOCKMAN and Juan Carlos ESTEVA Michigan State University, East Lansing, M! 48824, USA

Received 18 September 1984

Revised 15 March 1985

Abstract: A candidate pose algorithm is described which computes object pose from an assumed correspondence between a pair of 2D image points and a pair of 3D model points. By computing many pose candidates actual object pose can usually be determined by detecting a cluster in the space of all candidates. Cluster space can receive candidate pose parameters from

independent computations in different camera views. It is shown that use of geometric constraint can be sufficient for reliable pose detection, but use of other knowledge, such as edge presence and type, can be easily added for increased efficiency.

Key words: Object pose, object detection, inverse perspective, pose clustering, cluster-space-stereo, geometric constraint.

1. Introduction

The use of geometric constraints from perspective imaging of known models appeared in the earliest computer vision research and persists in recent work (e.g. Roberts (1965) and Chakravarti (1982)). Typically, considerable image analysis has been done in order to correctly correspond image points to model vertices and image lines to model edges. Pose of an object on a support plane can usually be decoded from only two correctly matching (image point: object vertex) pairs via inverse perspective computations.

Work reported here shows that global image analysis is not necessary for the detection of known objects; a set of sparse local features can be sufficient. Viewed another way, it shows that the geometric constraints of perspective imaging are very powerful. Other recent work (Grimson and Lozano-Perez (1984)) in matching sparse 3D sensed data to models also confirms the power of purely geometric constraint.

A candidate pose algorithm (CPA) is given for decoding pose via inverse perspective from two hypothesized matching feature pairs. The CPA

usually reports failure for arbitrarily matched feature pairs, thus showing the power of the constraints. Succesful computation by the CPA yields pose candidates which are clustered to yield a globally consistent pose. Apparently a useful and robust pose detection algorithm results. Topologi- cal and illumination constraints may be easily added to increase the efficiency of the algorithm. 'Cluster-space-stereo' allows matching evidence from several different camera views to be in- tegrated without having to perform any analysis across images. This permits multiple views and parallel computation yet avoids the 'correspondence problem' of ordinary stereo.

The major assumptions used here are ap- propriate to many industrial environments.

(AI) There are only a few objects possible and each has only a few stable states.

(A2) Objects are supported by a plane and cameras are all calibrated with respect to a global coordinate system on the plane.

(A3) Due to occlusion and variation in the sen- sors and environment, it is unlikely that all features of an object will be detected in an image.

(A4) Rigid geometrical models exist for objects

0167-8655/85/$3.30 © 1985, Elsevier Science Publishers B.V. (North-Holland) 279

Volume 3, Number 4 PATTERN RECOGNITION LETTERS July 1985

and detection of object pose is equivalent to discovering the instance transformation which places the object in the global coordinate system.

2. The candidate pose algorithm

The environment assumed is shown in Figure 1.

An object model M has pose (0, Tx, T~) in the g l o b a l s p a c e . A v i s i o n s y s t e m i d e n t i f i e s p o i n t

features Pl and '°2 in the image (as well as other features) but cannot be sure which model points are actually being observed. Only a few pairs of model points can image to PI and P., since the object is constrained to rotate and translate on the support plane and any 3D point imaging at P,

I.

lens

X. 1

Z Id

/ /

/ /

/ T

V

X

\

I v m

\

\

r

X

Figure I . Environmental constraints imply that few poses of" the block can image points at Pt and P2,

280

Volume 3. Number 4 PATTERN RECOGNITION LETTERS

L[ L2

July 1985

\

Qt~

/ ' P traces out the imago

of Q2 as the model rotates

about axis QI C

a*

Q2 -.--.. I ~

Figure 2. Model points QI and Q2 cannot image to Pl and /9 2 in camera 1 (0 exists but scale is bad) or in camera 2 (no 0 is possible).

must lie along ray LPi. Correspondences ((Pl, J ) (P2, B)) and ((Pl, I) (P2, A)) are equally likely in this situation but some others, such as ( (PI ,E) (P2, A)) are impossible. Figure 2 shows two general cases where it is impossible to slide the block so that specific points Ql and Q2 image at P~ and P_,.

The geometric constraints are represented algebraically via the homogeneous instance transformation and perspective transformation. Let R = (Xm, Ym ; Zm ; 1) be a model point in a body centered coordinate system, Q = ( x w , y , , zw, 1) be the world point in the global space, and let P = (u, o, 1) be the image of Q in image coordinates (see Ballard and Brown appendices). In matrix form we get the following equations.

Q = R T (instance, or pose, transformation), (I)

(Xw, Y~, Z~, I)

= (Xm, Ym, Zm, 1)

P = QA = ( R T ) A

(tu, to, t)

I cosO sin8 0 ! 1 - s i n O cosO 0

0 0 1 ' T x Ty 0

(perspective trans.), (2)

-1711 1712 GI3" ~

Q'22 Q 2 3 | .

i,._a41 a42 a43J

The camera matrix A is gotten via calibration and is constant for a given camera view. Different cameras viewing the scene will have different but constant camera matrices. Now, if it is given that two certain image points Pl and P, are the images

281


Camera Frame

l l .

World Frame

Support Plane

. ~ Y , , m

~ P l a n e /

f't :,~odeZ / / '~ \ Frame V .~ Xm X

\ Figure 3. The world coordinates (Xw, Y~,, Zw) of an image point (u, o) are determined via inverse perspective.

of two certain model points Qt and Qz only the 3 pose parameters (0, Tx, Ty) are unknown, yet there are 4 equations constraining their values. It is thus easy to solve for pose from Pi , /:'2, QI, (22, and A when a solution exists.

Failure of the computation usually indicates that the image points P, cannot correspond to the model points Qy. The four possible outcomes from the CPA are as follows.

(K0) All equations are satisfied with Ti=(O, Tx, Ty) so output candidate pose T/ to cluster space.

(Kl) A viewing accident: Pi and P2 are too close in the image to derive an accurate O.

(K2) Qj and Q2 are vertical in the model: 0 cannot be determined.

(K3) Computations break down (Figure 2): It is impossible to image Qi at Pi and Q2 at P2.

Mathematical computations of the CPA are given below.

A. Inverse perspective

It is shown how the world coordinates (xw, yw, zw) of a point Q observed as image point P = (u, o) are computed assuming that the camera matrix is known and that the model coordinates (Xm, Ym, Zm) of R are known. Because the model rests on the support plane, we have zw =Zm. This

leaves only x,~ and Yw unknown, and from matrix equation (2) there are two linear equations available to determine them. From the geometry of Figure 3 it can be seen that xw and Yw should be determined with reasonable accuracy provided that the ray L-(u, o) makes a blunt angle 4, with the

plane zw = Zm. If 0 is the downward tilt angle of the optical axis

(see Figure 3) then the worst radial error Ar in the zw plane is directly related to the standoff d and the error A0 in locating the point p in the image.

Ar<_d sin A0 (3) sin ¢

Assuming that 0 >-w/4 and that A0 is reasonably small we have

at___ ¢2 d AO. (4)

In our setup we had standoff d--- 1000 mm, focal length f = 3 5 mm, and pixel s i ze -0 .04 mm. In camera calibration, our root mean square error was typically l pixel in both u and o. Assuming that detection of point feature (u, o) is accurate to the nearest pixel in the image plane we should expect an average error in A0 of

2 pixels 0.08 mm A0 --- tan A0 --- - - = (5)

f 35 mm

Combining this with (4) we expect an average

282

Volume 3, Number 4 PATTERN RECOGN[TION LETTERS July 1985

L

Q2 R2

I /

( T x , Ty ) Figure 4. Determination of candidates pose (0, 7" r, T v) from assumed correspondences (Ri, P]) and (R 2, P2)-

radial error of about 3 mm in locating Xw and Yw on the zw plane. The error may be worse in case detection of (u, o) is not accurate to the nearest pixel.

B. Pose solution

Assume that P~ and P2 are the images of model points R t and R 2 respectively. Inverse perspective is used to solve for the world positions of R I and R 2. I f a solution exists, then points QI and Q2 are determined.

The z-coordinates of QI and R~ are the same,

similarly for Q2 and R 2 . The pose (0, Tx, Ty) of the object is then determined using only the changes in the x and y coordinates between the model points and world points as shown in Figure 4.

0~ = the direction of the projection of ray RI R, on the support plane.

0, = the direction of the projection of ray

Ql Q2 on the support plane. (6)

0 = 0 ~ - 0 1 .

Tx = Qlx - (R1x * cos O - Rl. v • sin O).

Ty = Qty- (Rt~ * sin O+ Rly * cos 0).

The CPA can be used to detect object pose by trying all possible feature pairings and then cluster-

ing any pose parameters gotten f rom successful computat ions. I f there are N feature points in the image a potential cluster of size N ( N - I ) /2 will in- dicate correct pose parameters. We have used an O(M 2) algorithm to cluster M candidate poses. For each of the M pose candidates we count how

many of the other M - I candidates are close to the original one. The distance tolerance is set f rom the expected error as in (3) - (5) and was (__ 3 °, +__ 3 mm, _+ 3 mm) in our experiments.

3. Simulation results

Two object models were constructed as sets of vertices and edges. These were given specific poses and run through the t ransformations (l) and (2) to form a junction and line image as if there were no self-occlusion by opague surfaces. 2D image points were then paired with 3D model vertices to solve for candidate pose parameters using the CPA. Both models had 12 vertices; one was the block of Figure l and the other was generated by a random process. Results of using the CPA are shown in

283

Volume 3. Number 4 PATTERN RECOGNITION LETTERS July 1985

Table 1 Results of applying the candidate pose algorithm to all pairs of image points and model points. Ki outcomes are explained in Section 2

Object type Pose Parameters Camera parameters Outcomes of 9504 calls Cluster 8 size T.r T, f standoff K0 KI K2 K3

Block 30 ° - 2 3 I 10 452 0 1452 7600 61 a Block 210 ° - 2 3 I 10 484 0 1452 7568 61 a Block 18 ° 2 2 I I0 510 0 1452 7542 61 a Block 30 ° - 2 3 25 1000 1450 720 1342 5992 88 Random 30 ° -0 .4 -0.6 0.25 5.0 287 0 1056 8161 62

a The predicted cluster size is 66 abstract edges - 5 vertical edges=61 correct pose parameter sets.

T a b l e I. The C P A was tr ied on all

( ( 1 2 . 1 1 ) ) / 2 ) . ( 1 2 , 1 2 ) = 9 5 0 4 poss ib le pair ings .

On ly between 3070 and 15°70 o f the pa i r ings tr ied

resul ted in feasible pose pa ramete r s , d e m o n -

s t ra t ing the power o f pure ly geomet r ic cons t ra in ts .

A l so , in each case there was a subs tan t ia l c luster o f

cor rec t pose pa r ame te r s represent ing 6o70 to 20°70

o f an o therwise sparse cluster space.

4. Experiments in real imaging environment

The ma thema t i ca l f o r m u l a t i o n o f the previous

sect ion easily a l lows i n f o r m a t i o n f rom several

cameras to be merged , and several cameras may be

requi red to adequa t e ly view some objects . Since

the C P A ou tpu t s c and ida t e poses in g lobal coor -

d ina tes and since all cameras can be ca l ib ra ted to

the same g loba l f r ame , a single cluster space can

receive pose evidence f rom smar t sensing systems

work ing in para l le l . Reviewing the ma thema t i ca l

cons t ra in t s , one can see that it is even possible to

a s sume a (P i , R i ) co r r e spondence f rom one image

and a (P2, R2) co r r e spondence f rom ano the r im-

age as long as the cor rec t c amera mat r ix is used in

c o m p u t i n g the consequen t loca t ions o f QI and Q2

in the world . In the exper iments r epor t ed below,

po in t s Pi and "°2 were a lways f rom the same

image. It is i m p o r t a n t to note that no corre-

spondences are a s sumed between poin ts o f d i f -

ferent images and thus the ' c o r r e s p o n d e n c e pro-

b l e m ' o f o r d i n a r y s tereo is not encoun te red .

Three c lus te r - space-s te reo exper iments were run

on our lab bench. Each ob jec t was viewed ob l ique-

ly f rom above f rom two separa te ly ca l ib ra ted

c a m e r a pos i t ions . One objec t was the b lock o f

F igure 1, ano the r was a soft d r ink can, and a th i rd

was a c o m m o n box from a g roce ry s tore . Fe a tu r e

po in ts used were ei ther vert ices or easily

d i s t ingu ished mark ings . Fo r the coke can no 3D

shape fea ture po in t s were used; on ly poin ts o f 2D

co lo r con t ras t were used, such as the center o f the

let ter ' O ' or the corner o f a red s t r ipe on a l ight

b a c k g r o u n d . Image poin ts were h u m a n selected

f rom a C R T and then input to the C P A a long with

the ob jec t mode ls . The results given in T a b l e

2 show the same pa t t e rns as our s imula t ions .

T h e first sect ion o f the table shows good results

ach ieved using all fea ture poin ts visible to the

h u m a n in the images . Sect ions 2 and 3 show

weaker results as some o f the f ea tu re po in ts were

r a n d o m l y de le ted to pa r t i a l ly s imula te real f ea tu re

de tec to r p e r f o r m a n c e . We did not s imula te false

a l a rms in fea ture de tec t ion . In all cases, the use o f

two cameras was essential to get enough fea ture

po in t s for ma tch ing . In all 21 tr ials r epo r t ed , the

n u m b e r o f ou t comes o f (K0) f rom the C P A was

be tween 3°/0 and 5o70 o f the to ta l pa i r ings t r ied .

The coke can, easy to recognize by o the r means ,

p roved to be d i f f icu l t to handle with this a p p r o a c h

because few fea ture po in ts were visible and all but

one were on the ver t ical cyl inder . W h e n a g o o d

c lus ter existed, accu racy o f the c o m p u t e d pose was

in ag reemen t with the er ror analys is given in Sec-

t ion 2. (In the first case o f the coke can it is poss ib le

that the result o f cluster ing is bet ter than the

g r o u n d t ru th! )

5. Discuss ion

The power o f pure ly geomet r i c cons t ra in t s is evi-

den t f rom the expe r imen ta l results . On ly a b o u t 5~0

o f a rb i t r a r i ly ma t c he d poin t pai rs surv ived the con-

284

Volume 3, Number 4 PATTERN RECOGNITION LETTERS

Table 2 Results of cluster-space-stereo experiment with different detection probabilities

July 1985

Actual pose parameters Block (12 pts)

T,. T,, Theta

75 mm 8 0 m m 0deg

Coke can (18 pts) Glad box (21 pts)

Tv T~, Theta /~ E~' Theta

102 mm 140 mm 250 deg 185 mm 188 mm 218 deg

P = 1.0

P = 0 . 8

P=0 .5

# points image 1/2 6/10

# K0outcomes of CPA 416/5%

Best cluster size 14

Pose T,. 75

Parameters T, 79 Theta I

# points image I/2

# K0 outcomes of CPA

Best cluster size

Pose T x Parameters 7",

Theta

# points image I /2

# K0 outcomes of CPA

Best cluster size

Pose T x

Parameters Tv

Theta

6/5 10/6

258/3% 939/4%

4 15

104 186

140 189

257 219

Three trials Three trials Three trials

4/8 6/10 5/4 4,'3 6/5 5/4 8/4 10/6 8/5

221/5% 416/5% 247/5% 105/4% 258/3% 147/3% 498/4% 939/4% 592/4%

8 14 8 2 4 2 8 15 8

74 75 75 103 104 103 186 186 186

77 79 80 141 140 141 190 189 190

1 I 1 258 257 258 219 219 219

2/5 5/8 5/6 2/I 5/4 5/4 5/2 8~5 6/5

65/4% 278/5% 164/5% 11/3%1 152/3% 147/3% 172/4% 609'4% 410/4%

2 10 6 - 3 2 3 10 8

71 74 75 - 137 103 147 IS6 186

84 83 81 - 137 141 212 190 190

358 358 1 - 47 258 245 219 219

P is the probability of using an.,,' given image feature point.

s t ra in tsand yielded candidate poses. Then, only a

small percentage of the candidate poses clustered about correct pose parameters. Moreover, geometric information from several camera views is easy to integrate.

Clustering information is given in the tables,

f rom which it is indicated that an algorithm for detection of the correct pose using clustering should be fairly robust. We have determined a best cluster by counting the number of other points (in pose parameter space) within a given tolerance of each point. The tolerance is known from the measurement accuracy and was 3 ° x 3 mm x 3 mm in our experiments with real data.

The geometric constraints are also available to a hypothesis-and-test type algorithm. Such an algorithm would sequentially at tempt to verify each candidate pose by forward transforming model points using (1) and (2) and checking the image for observed points near those predicted. Several more than two points are required to con- firm a candidate pose and, as Roberts had discovered, to refine the accuracy of the pose

parameters . If the clustering algorithm is used in-

stead, the center of the cluster provides the refined pose which is much more accurate than any single pose in the cluster.

Finally, it is important to note that other constraints can easily be added to either a hypothesize- and-test or a clustering algorithm for pose detection. Non-global image processing or active illumination techniques can add type information to point features and reveal edge and face relation- ships which can drastically reduce the matching possibilities and yield very efficient pose detection. Thus, global image analysis, usually a very dif- ficult task, seems neither necessary nor desirable for the environments assumed in this paper.

References

Roberts, L. (1965). Machine perception of 3D solids. In: J. Tip-

pett et al., Eds., Optical and Electro-Optical Information Proc., MIT Press, Cambridge. MA.

Chakravarti, I. 0982). The use of characteristic views as a basis

for recognition of 3D objects. Ph.D. dissertation IPL-

TR-034, Image Proc. Lab., RPI, Troy, NY.

285


Grimson, W. and T. Lozano-Perez (1984). Model-based recognition and localization from tactile data. Proc. Inter- nat. Conf. Robotics, Atlanta, GA, March 13-15, 248-255.

Stockman, G., S. Kopstein and S. Benett (1982). Matching images to models for registration and object detection via clustering, IEEE-PAMI 4, 229-241.

Ballard, D. and C. Brown (1982). Computer Vision. Prentice- Hall, Englewood Cliffs, NY.

Stockman, G. and J.C. Esteva (1984). Use of geometrical constraints and clustering to determine 3D object pose. Dept. of Computer Sci. TR#84-002, Michigan State University, E. Lansing, MI.

Stockman, G. and J.C. Esteva (1984). 3D object pose via cluster space stereo. Dept. of Computer Sci. TR#84-005, Michigan State University, E. Lansing, MI.

286

Documents

3D object pose form clustering with multiple views