7
IEEE Proof IEEE TRANSACTIONS ON IMAGE PROCESSING 1 The Use of Vanishing Point for the Classification of Reflections From Foreground Mask in Videos László Havasi, Zoltán Szlávik, Member, IEEE, and Tamás Szirányi, Senior Member, IEEE Abstract—Extraction of foreground is a basic task in surveillance video analysis. In most real cases, its performance is heavily based on the effi- ciency of shadow detection and on the analysis of lighting conditions and reflections caused by mirrors or other reflective surfaces. This correspon- dence is focused on the improvement of foreground extraction in the case of planar reflective surfaces. We show that the geometric model of a scene with a planar reflective surface is reduced to the estimation of vanishing-point for the case of an auto-epipolar (skew-symmetric) fundamental matrix. The correspondences for the vanishing-point estimation are extracted from mo- tion statistics. The knowledge of the position of the vanishing point allows us to integrate the geometric model and the motion statistics into image foreground-extraction to separate foreground from reflections, and thus to achieve better performance. The experiments confirm the accuracy of the vanishing point and the improvement of the foreground image mask by re- moving reflected object parts. Index Terms—Image and video processing, robust estimation, scene anal- ysis. I. INTRODUCTION T HE determination of the position of the vanishing point [1] (or focus of expansion, FOE [2], or mirror pole [3]) in case of a skew-symmetric fundamental matrix is a task that has rarely been the object of investigation, especially for cases where the input is a noisy outdoor video sequence which contains a planar reflective surface within the field of view. These situations occur frequently in surveillance videos [6]–[8], and they inevitably cause problems in further image-processing steps and reduce the processing system’s performance (usually, a manually defined mask is used in the surveillance systems to avoid false detections; nevertheless, this simple masking technique may remove object parts). Most previous publications which have focused on the use of a mirror to accomplish the 3-D reconstruction task have done so only for an indoor scene [1], [3], [4]; moreover, most of these works have relied on hand-selected point correspondences. A principal theoretical foundation in handling this aspect of the topic is the mirror-stereo theorem. This posits that the view of a scene con- taining a mirror taken with a projective camera is equivalent to a com- bination of two views from two projective cameras and, hence, that traditional processing methods for two-view stereo images can be ap- plied [4]. As with most “normal” multicamera configurations, in the camera-mirror setting, the transformation may be computed by using corresponding point pairs. A corresponding point pair may be defined as the knowledge of the position both of the original point, and of its transformation (in our case, its reflection). For a typical outdoor image, small objects (such as people) visible in the field of view of the virtual camera have textures which are low-detailed; this is a principal reason Manuscript received nulldate; revised January 14, 2009. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ercan E. Kuruoglu. The authors are with the Distributed Events Analysis Research Group, Computer and Automation Research Institute of Hungarian Academy of Sci- ences, H-1111 Budapest, Kende u. 13-17, Hungary (e-mail: [email protected]; [email protected];[email protected]). Digital Object Identifier 10.1109/TIP.2009.2017137 why the extraction of correspondences is such a challenging task in this situation. In this correspondence, we introduce a novel exploitation of the so-called co-motion statistics (introduced in [10]) of a video se- quence. In contrast to [10] the statistics have been investigated in a model-based framework in this correspondence. For VP estimation, we have previously given the statement of the problem in [11], with an initial method to achieve its solution. The only information that is used is the change-mask of moving objects, or more generally the change-detection (binarized intensity- change) mask (in the literature, the motion detection task—more so- phisticated than simple change-detection—is commonly referred to as foreground extraction). This is the basic information that can be ex- tracted from a video sequence without making any a priori assump- tions about scene content. Reflections in surveillance videos usually cause problems in image analysis [15]. This is because it appears in the foreground mask ex- tracted by using an adaptive background model e.g., [16]. In turn, the inaccurate mask reduces the performance of the further image-pro- cessing steps. Consequently, techniques for the avoidance of such dis- turbances constitute an active current research area [15], [17]. At the end of the correspondence, we present a method which integrates the estimated geometric model and the extracted statistics to enable re- moval of the pixels related to reflection. The main contributions of the correspondence are as follows. 1. Method for extraction of corresponding points from a single-view image-sequence—Our correspondence-detection method, based on motion statistics, works reliably in both normal and in wide- baseline camera configurations; and here we demonstrate its ben- efit in the camera-mirror setting where one image combines the views of the two cameras. 2. Integration of geometric model and motion statistics for fore- ground classification—The extracted, numerically computed co-motion statistics are useful not only for model parameter estimation but also for classification purposes. After refer- ence-statistics have been collected, we shall thus have a novel and more reliable classification method for decisions as to whether a point is a true foreground object, or just a reflection. The assumptions we use in the correspondence are as follows. The camera is static in position. The mirror is planar. There is only one mirror in the image. The image plane is not parallel to the mirror surface, the included angle is acute-angle. There is sufficient motion in the video sequence to generate reli- able motion-statistics. Nonlinear lens distortions can be neglected. The correspondence is organized as follows. The next section gives a brief overview of the video test-sequences used. Then, after the deriva- tion of the geometric model, we discuss two important properties of the camera-mirror setting. In Section III, the discussion of correspon- dence-extraction approach is presented, including a brief introduction of co-motion statistics. In the second part of the correspondence, we focus on the use of correspondences for the optimization of the VP position (Section IV) and foreground classification (Section V). The experimental results and a comparison with the theoretical estimates are presented in Section VI. A. Test Sequences Video data for our tests was obtained in various environments, both indoor and outdoor. No video filtering for image enhancement was 1057-7149/$25.00 © 2009 IEEE

The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

The Use of Vanishing Point for the Classification ofReflections From Foreground Mask in Videos

László Havasi, Zoltán Szlávik, Member, IEEE, andTamás Szirányi, Senior Member, IEEE

Abstract—Extraction of foreground is a basic task in surveillance videoanalysis. In most real cases, its performance is heavily based on the effi-ciency of shadow detection and on the analysis of lighting conditions andreflections caused by mirrors or other reflective surfaces. This correspon-dence is focused on the improvement of foreground extraction in the case ofplanar reflective surfaces. We show that the geometric model of a scene witha planar reflective surface is reduced to the estimation of vanishing-pointfor the case of an auto-epipolar (skew-symmetric) fundamental matrix. Thecorrespondences for the vanishing-point estimation are extracted from mo-tion statistics. The knowledge of the position of the vanishing point allowsus to integrate the geometric model and the motion statistics into imageforeground-extraction to separate foreground from reflections, and thus toachieve better performance. The experiments confirm the accuracy of thevanishing point and the improvement of the foreground image mask by re-moving reflected object parts.

Index Terms—Image and video processing, robust estimation, scene anal-ysis.

I. INTRODUCTION

T HE determination of the position of the vanishing point [1](or focus of expansion, FOE [2], or mirror pole [3]) in case

of a skew-symmetric fundamental matrix is a task that has rarelybeen the object of investigation, especially for cases where theinput is a noisy outdoor video sequence which contains a planarreflective surface within the field of view. These situations occurfrequently in surveillance videos [6]–[8], and they inevitably causeproblems in further image-processing steps and reduce the processingsystem’s performance (usually, a manually defined mask is used inthe surveillance systems to avoid false detections; nevertheless, thissimple masking technique may remove object parts). Most previouspublications which have focused on the use of a mirror to accomplishthe 3-D reconstruction task have done so only for an indoor scene [1],[3], [4]; moreover, most of these works have relied on hand-selectedpoint correspondences.

A principal theoretical foundation in handling this aspect of the topicis the mirror-stereo theorem. This posits that the view of a scene con-taining a mirror taken with a projective camera is equivalent to a com-bination of two views from two projective cameras and, hence, thattraditional processing methods for two-view stereo images can be ap-plied [4]. As with most “normal” multicamera configurations, in thecamera-mirror setting, the transformation may be computed by usingcorresponding point pairs. A corresponding point pair may be definedas the knowledge of the position both of the original point, and of itstransformation (in our case, its reflection). For a typical outdoor image,small objects (such as people) visible in the field of view of the virtualcamera have textures which are low-detailed; this is a principal reason

Manuscript received nulldate; revised January 14, 2009. The associate editorcoordinating the review of this manuscript and approving it for publication wasDr. Ercan E. Kuruoglu.

The authors are with the Distributed Events Analysis Research Group,Computer and Automation Research Institute of Hungarian Academy of Sci-ences, H-1111 Budapest, Kende u. 13-17, Hungary (e-mail: [email protected];[email protected];[email protected]).

Digital Object Identifier 10.1109/TIP.2009.2017137

why the extraction of correspondences is such a challenging task in thissituation.

In this correspondence, we introduce a novel exploitation of theso-called co-motion statistics (introduced in [10]) of a video se-quence. In contrast to [10] the statistics have been investigated in amodel-based framework in this correspondence. For VP estimation,we have previously given the statement of the problem in [11], withan initial method to achieve its solution.

The only information that is used is the change-mask of movingobjects, or more generally the change-detection (binarized intensity-change) mask (in the literature, the motion detection task—more so-phisticated than simple change-detection—is commonly referred to asforeground extraction). This is the basic information that can be ex-tracted from a video sequence without making any a priori assump-tions about scene content.

Reflections in surveillance videos usually cause problems in imageanalysis [15]. This is because it appears in the foreground mask ex-tracted by using an adaptive background model e.g., [16]. In turn, theinaccurate mask reduces the performance of the further image-pro-cessing steps. Consequently, techniques for the avoidance of such dis-turbances constitute an active current research area [15], [17]. At theend of the correspondence, we present a method which integrates theestimated geometric model and the extracted statistics to enable re-moval of the pixels related to reflection.

The main contributions of the correspondence are as follows.1. Method for extraction of corresponding points from a single-view

image-sequence—Our correspondence-detection method, basedon motion statistics, works reliably in both normal and in wide-baseline camera configurations; and here we demonstrate its ben-efit in the camera-mirror setting where one image combines theviews of the two cameras.

2. Integration of geometric model and motion statistics for fore-ground classification—The extracted, numerically computedco-motion statistics are useful not only for model parameterestimation but also for classification purposes. After refer-ence-statistics have been collected, we shall thus have a novel andmore reliable classification method for decisions as to whether apoint is a true foreground object, or just a reflection.

The assumptions we use in the correspondence are as follows.• The camera is static in position.• The mirror is planar.• There is only one mirror in the image.• The image plane is not parallel to the mirror surface, the included

angle is acute-angle.• There is sufficient motion in the video sequence to generate reli-

able motion-statistics.• Nonlinear lens distortions can be neglected.The correspondence is organized as follows. The next section gives a

brief overview of the video test-sequences used. Then, after the deriva-tion of the geometric model, we discuss two important properties ofthe camera-mirror setting. In Section III, the discussion of correspon-dence-extraction approach is presented, including a brief introductionof co-motion statistics. In the second part of the correspondence, wefocus on the use of correspondences for the optimization of the VPposition (Section IV) and foreground classification (Section V). Theexperimental results and a comparison with the theoretical estimatesare presented in Section VI.

A. Test Sequences

Video data for our tests was obtained in various environments, bothindoor and outdoor. No video filtering for image enhancement was

1057-7149/$25.00 © 2009 IEEE

Page 2: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

2 IEEE TRANSACTIONS ON IMAGE PROCESSING

TABLE ITEST SEQUENCES

Fig. 1. Simple model for reflective surface: � � �� � � � � � is the cameracenter and � is the plane for central projection (image plane). An arbitrary 3-Dpoint� has a virtual point pair because of the mirror plane ���, and termed by� , likewise � (reflection of the camera center). Consistently, the 2-D pointsin the image plane are ��� and � , where � corresponds to the reflection ofthe camera center.

used. The test data is derived from three video sequences, Table I sum-marizes the main properties of these video sequences.

II. MODEL DESCRIPTION

Fig. 1 shows in diagrammatic form a common case of a reflectivesurface (e.g., a mirror), denoted by �, which lies in the (x-y) plane(right-handed system). � denotes the camera center, and the imageplane is denoted by � (3-D points are mapped to this plane via cen-tral projection). The uppercase bold letters (e.g., �) denote 3-D pointcoordinates (in vector form), the elements of which will be denoted by� � ���� ��� ���

� . The lowercase bold letters are the 2-D points (invector form) on the camera plane. The homogenous vectors are desig-nated by a tilde, e.g., ��.

Without loss of generality, in the following relationships, we assumethat the original points lie on the positive side of the � axis, thus the thirdcoordinate is always positive for all original points (e.g., �� � �). Inthe diagram the two angles � and � are included angles of 3-D vectors,

defined by � ������

������

� and � ������

� �����. In our model, the

camera is a general projective camera [2]. The matrix denotes thecamera which maps world points � to image points � according to�� � ��.

Furthermore, we introduce the notation that the columns of are ��

� ��� �� �� �� � (1)

The effect of mirror �may be described by the following coordinatetransformation:

�� ��� (2)

where � is a 3� 3 matrix or a 4� 4 matrix in case of homogenouscoordinates

� �

� � �

� � �

� � ��

� �� �

� � � �

� � � �

� � �� �

� � � �

(3)

This transformation is the reflection in the (���) plane which oper-ates only on the � coordinate. Thus, the image point of the virtual point�� generated by the reflection is

��� � ��� � �� �� (4)

Property 1: The fundamental matrix corresponding to the originalimage and the virtual image in a camera-mirror scene is of the form � �����, where �� is the vanishing point (VP). Consequently, F has2 degrees of freedom and is identified with the VP

� ��������� � ����� (5)

Proof: See Appendix I.From this formula, we see that is skew-symmetric and is formed

from the VP

� �� ���

� � �������� ��� �

(6)

In case of skew symmetric matrix , the fundamental con-straint may be transformed into the collinearity constraint: namelythat the points ���� and �� lie on a common line. This followsdirectly from the rewritten form of the fundamental constraint:���� ��� � ���� ��

� � ���� � ����� ��� � ���� � � where ��� and ��� are

an arbitrary corresponding point pair (� and ��� �� denote the crossproduct and the dot product, respectively).

Property 2: From a given corresponding point pair, the nearest pointto the VP is the reflection.

Proof: See Appendix II.The importance of this property lies in the fact that knowledge of the

position of the VP makes it possible to decide whether a given point istruly the reflection of another point, i.e., whether they form a corre-sponding point pair.

III. STATISTICAL EVALUATION OF VIDEO SEQUENCES

Since the determination of the model is equivalent to defining thetransformation between the original and the virtual views in camera-mirror case, this transformation may be determined by using corre-sponding point pairs in the two views.

A. Co-Motion Statistic

Our correspondence-detection method is based on processing of theso-called co-motion statistics [10]. These statistics have been success-fully used for image registration in case of wide-baseline camera pairs.

Briefly, the co-motion statistics of a point is a descriptor of spatialcorrelations to the other image-points. Its algorithmic implementationis executed with the temporal integration of motion masks; which leadsto an approximation of the co-motion statistics provides useful infor-mation about the scene geometry. Here, the statistics are collected fromonly one camera (in contrast to the situation in [10]), which gives usimmediate accessibility to the local co-motion statistics. These statis-tics come from the temporal summation of the binarized motion masks;these are written �� �� where � is the index of the frame and the 2-Dvector � is the point in the image. Thus

Page 3: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

HAVASI et al.: USE OF VANISHING POINT FOR THE CLASSIFICATION OF REFLECTIONS 3

����� ��� ����� ���� � ������� � �� ���

�� ������ ���

(7)

The global motion statistics, which is the motion intensity, is formal-ized with the relationship

����� � ������

��(8)

(because of the discrete time steps, �� denotes the frame count of theprocessed sequence). The concurrent-motion probability of an arbitraryimage-point � with another image-point � may be defined with thefollowing conditional-probability formula:

������ � ����� ��� � �����������

������

� (9)

In the implementation the above-defined �� � � after normalization (i.e.,

��������� � �) is assigned to every pixel in the image. It follows

that there will be local maxima (peaks) in the probability-maps in posi-tions where motion was often detected concurrently. Sample statisticscan be seen in Fig. 5 in the experiments section.

B. Extraction of Corresponding Point Pairs

In case of a visible reflective surface two peaks are probable in theco-motion statistics defined by (9), thus the PDF is modeled with asimple Gaussian mixture model (GMM) with two components

����� ��� � ������� ��� ��������

��������

����������� ��������

��������

� ������� ��� � ������� ���� ��

������ ������� � � (10)

where the 2-D normal distribution is defined by

���� ������ � � �������

� ��� ��

������������������ �

After the collection of statistics (9), the model parameters can beestablished by using the simple EM algorithm [19] or one of its variants[20]. So, the statistics are updated online and the GMM parametersare estimated at the end of the processing phase. The first componentin (10), with weight ������, applies to points in the near vicinity ofthe investigated point �, termed by ������ � �. The second component(with weight������) describes the far concurrent movements (e.g., in thereflection), this component is denoted by ������ � �. It can be expressedgeometrically with �����

����� �� � �����

����� ��.

IV. MODEL ESTIMATION

The determination of VP coordinates (the model parameter) is per-formed as the computation of the intersection of lines defined by cor-responding point-pairs. There are several approaches for model esti-mation in camera systems, good surveys can be found in [2] and [22].Generally, the model determination problems need some robust esti-mator [2] that handles the function as an (global) optimization task,thus there is no analytic solution [28] for such robust parameter esti-mation procedures; for survey, see [22].

In our implementation, the parameter estimation comprises twosteps. Firstly we reduce the number of outliers, and secondly a non-

Fig. 2. Rejection of outliers for the “Shop” sequence. Only the directions cor-responding to the main peak (mode) of the histogram (determined from the linedirections) will be used for later computations. (a) before rejection (only 320of the total 3566 point pairs are displayed), (c) after rejection (382 point pairs);(b), (d) the corresponding histograms of angles.

linear optimization is performed for final VP estimation. The inclusionof the information extracted from global and co-motion statistics intothe objective function conveniently provides an approach which isslightly differs from previous methods.

A. Outlier Rejection

Depending on the configuration of the observed scene, not everymoving point will necessarily have a visible reflection. The followingmethod excludes a considerable proportion of these outliers (pointswhich have no reflection).

For the purpose of this discrimination we make use of the orientationof the line determined by the two points of a corresponding point pair.In case of inliers these lines point to (or near to) the VP, thus the direc-tions are clustered around a characteristic value, depending on the scenesetting. This is a well-known technique for processing motion vectorsin navigation tasks. For detailed description of this technique and im-plementation issues we refer to [27]. The direction histograms (beforeand after outlier rejection) and the filtered correspondences can be seenin Fig. 2(c). It can be seen from Fig. 2(d) that the resulting cluster ofdirections contains directions of lines that are not necessarily parallelbut point to VP.

This simple method also works well in indoor images (e.g., our“Ants” and “Mice” sequences) because not only the points actuallyat the main peak are retained in the final data set, but also the pointsaround this peak (within 2 � distance from the peak which means thatthe method accounts for 95% of the points of the distribution).

Finally we have to briefly discuss another popular optimizationmethod, which is often used for transformation optimization incamera-systems, namely the RANSAC (RANdom SAmple Con-sensus) [2]. The RANSAC algorithm partitions the data set into inliersand outliers, and also delivers an estimate of the model to be estimated.The RANSAC algorithm is able to cope with a large proportion ofoutliers; however, it requires manual setting of thresholds. In our casewhen almost all of the points are noisy RANSAC failed to computethe VP and because of the large number of correspondences the com-putation is extremely time-consuming. The unsuccessful optimization

Page 4: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

4 IEEE TRANSACTIONS ON IMAGE PROCESSING

was because the small set of selected points are not enough for VPestimation in “Shop” scene.

B. Optimization Procedure

Because the VP is computed from line intersections, the remainingoutliers and the measurement errors in point coordinates cause con-siderable problems during parameter estimation. The most notableproblem with outdoor scenes is that the near parallel lines throughcorresponding points lead to large deviations in the position of lineintersections. The method that has been introduced utilizes the extrainformation extracted from videos: the motion statistics and the largenumber of correspondences for the computation of geometric modelfitting. The VP is the argument of the goodness-of-fit function at itsglobal maximum

�� � �� � ������

���

������������� ��� (11)

The function ����� returns the 2-D position related to the largestvalue of the Gaussian function corresponding to ������ �� where thepoints ���������� and � are collinear

����� � ��� ������

������ �� ��� � �������� � �� �� � ��

(12)

Note that this expression has a closed form solution [28]. Further detailsof the optimization can be found in [11], [12] and [13]. In contrastto other optimization procedure [13], the above introduced objectivefunction is general enough to use in other geometrical model estimationproblems [14]. To summarize, the proposed function comes from themotion statistics directly and contains the measurement errors and therelevance weights.

The knowledge of the VP position assists in the partitioning of theGMM components - corresponding to an arbitrary co-motion statis-tics—into two parts: one for the original point, and the other for itsreflection. This is simple because using property 2 the nearest point tothe VP is the reflection. A new notation is introduced in the followingconversion of (10):

���� �� � ����� �� � ���� �� (13)

where �������� � ��� � ������� � �

��. Based on this repartitioning wecan define the following pair of 2-D probabilities for motion; one formotion probability in the foreground region

��� �� ��

�������

�������� �� (14)

and the other for motion probability in the reflection region

���� �� ��

�������

������� ��� (15)

These probabilities are illustrated in Fig. 3(c) and (d).

V. IMPROVED FOREGROUND CLASSIFICATION

In this section, we present a possible application of the determinedVP and co-motion statistics. The essence of this task is the removal ofthose pixels from the foreground mask which correspond to reflection.

The classification method we apply is based on the Bayes decisionrule [29]. We will show how the geometric model and the statisticscan be included into the class-conditional probability function. Con-sider the two classes: reflection and foreground � ���� ���. Pointsin the motion mask belong to the classes with the a priori probabili-ties � �� � �� � ��� �� for occurrence of the foreground class,

Fig. 3. Main steps of the classification process which supports the removal ofreflections from the foreground mask. For details on notations see text.

and � �� � �� � ���� �� for occurrence of the reflection class.Furthermore, the decision rule may be written in the form: assign � toforeground class �� � �� if

� � ��� � ����� � ��� �� ��� � ����� � �� (16)

where the conditional probability of a foreground pixel given the fore-ground class is formulated as

� � ��� � �� � ������

��� ��� ����� ������ (17)

This expression takes into account that the foreground pixel may havea reflection. The term ��� [defined in (13)] predicts the location ofthe reflection in a statistical term. The term �� relates the originalpoint with its geometric model predicted reflection and makes the esti-mation consistent with the estimated geometric model, and is definedby

��� ��� � � � ��� �� �� � �� � �� (18)

It determines a line (and its surroundings) from a given point � throughthe VP. The term ��� reduces the search for image points wheremotion was detected at frame �. Accordingly, (17) is equivalent to theprobability that expresses that� has a reflection consistent with the esti-mated geometric model and the estimated motion statistics somewherein the image at frame �. Based on the above discussion, the conditionalprobability for the “reflection” class takes into account that the reflec-tion is related to an original point and is given by

� � ��� � �� � ������

��� ��� � ����� ������ (19)

Some of the 2-D probabilities and the classification results aredemonstrated in Fig. 3. Note that there are some cases when weshould not make any decision during classification [29] (e.g., thosepoints where there is no reflection). In these unclassifiable cases theproducts in (16) are conspicuously low. To eliminate these points weintroduce a threshold value; it was determined experimentally that��� is a suitable order of magnitude for this threshold for all testsequences [29]. The above explained classification step and the usedprobabilities are illustrated in Fig. 3, where a moving pixel is selectedin the reflection part in (b) and the pixel-level statistics are in (e) and(f). The final classification results are in (g), foreground pixels and (h),reflection pixels.

VI. EXPERIMENTAL RESULTS

Our basic motion detection method [30] is founded on the back-ground model introduced by Stauffer and Grimson; their accurate butrather time-demanding foreground detector is a good basis for furtherclassification of the foreground mask.

Page 5: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

HAVASI et al.: USE OF VANISHING POINT FOR THE CLASSIFICATION OF REFLECTIONS 5

TABLE IIRESULTS ON MODEL OPTIMIZATION

TABLE IIIDRS AND FARS FOR THREE VIDEO SEQUENCES

A. Model Estimation

The numerical results of VP estimation are summarized in Table II.The numbers of point pairs in the processing steps and the good-ness-of-fit value corresponding to the optimized VP are displayed forthree different test videos. The results are illustrated in Fig. 5, wherethe collinearities are observable, and which demonstrate the accuracyachieved (because of the large coordinate values, the VPs themselvesare not shown in the images). In Table II, the true VP values arebased on manual extrapolation. It is important to note that—as isalso demonstrated in the sample images—the apparent inaccuracyin the VP’s position in case of the “Shop” video has not caused aperceptible error in the collinearity. This is because these VPs arenearly at infinity.

Probably the larger objects (as in the “Mice” video) resulted in bettergoodness-of-fit values because the allowable margin is larger than isthe case for smaller objects such as occur in the “Ants” and “Shop” se-quences. In evaluation, the accuracy of the computed VP is conspicuousfor the “Ants” video; the reason is that the small and rarely-moving ob-jects generated accurate corresponding point pairs.

B. Foreground Classification

Based on the manual validation, we have found that the error rate offoreground extraction was reduced. The proposed classification evalu-ated only points which were detected as foreground by [31]. The per-formance is characterized by the measures proposed in [15]: “DetectionRate—DR” and “False Alarm Rate—FAR.” These values are obtainedas follows: �� � ������ � � and ��� � ������ � �,where TP is the number of correctly detected object’s pixels, FN themissed object’s pixels, and FP the reflection pixels incorrectly detectedas object’s pixels. The DR and FAR rates for three sequences are shownin Table III.

The fundamental limitation of the classification procedure is that it isusually unable to disjoint the motion mask in cases when the real objectmask and its reflection are linked where the mirror and the camera planenear parallel. Fig. 4(a) and (b) illustrates such situations.

C. Performance Evaluation

In the evaluation a Pentium IV 3 GHz was used. The computationcosts of parts of the algorithm are as follows.

• The update of the whole statistics required 12-25 ms for eachvideo frame; see equations (8) and (9).

• The correspondence extraction took 200–1500 ms; see (10).

Fig. 4. Challenging situations of foreground segmentation in scenes from the“Mice” and “Shop” sequences. In the detected motion mask for (a)–(c), the ob-ject fuses with its reflection. In the first, there are the input frames, the secondshows the output.

• The optimization step defined by (11) was accomplished in 2–15s. The number of correspondences affects this computation time.

The implementation of foreground classification helps us to achievean acceptable processing time of 50–110 ms per frame.

VII. CONCLUSION

We have introduced a novel application of co-motion statistics ap-plicable to video sequences, which allows estimation of the vanishing-point in a geometrical model of a view containing a plane mirror. Inconclusion, the main advantage of our method is that it is not dependenton scene content and it is robust in situations where manual configu-ration is difficult. Based on the estimated geometrical model a simplemethod has been given for the improvement of the foreground segmen-tation step. This postprocessing step is a possible way to remove re-flections from the previously extracted foreground mask (determinedby some arbitrary algorithm).

APPENDIX IPROOF OF PROPERTY 1

We start the algebraic derivation of the model similar fashion to thework of Xu and Zhang [18]. The ray back-projected from � by P isobtained by solving �� � � �� (the general projective camera model).The solution is given as a 3-D line in parametric form (the ray is pa-rametrized by the scalar �)

Page 6: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

6 IEEE TRANSACTIONS ON IMAGE PROCESSING

Fig. 5. Summary illustrating the main steps for all the test videos.

���� � ���� �� (20)

where �� is the pseudo inverse of � (i.e., ��� � �). The epipolarline is the line joining the projections of two reflected points: �� and��. These projected points are expressed by using (2) and the equationof the epipolar line is determined by the cross product

� � �� �� ���� �� ������� � � �� (21)

where � is the fundamental matrix. This formula defines a point-linemap, thus the � may be expressed by

� � �� �� ���� �� ����� � ������� ���

� (22)

where the notion ���� in general form is defined by [2] as follows:

���� �

� ��� ��

�� � ������ �� �

� (23)

Thus, the cross product is related to skew-symmetric matrices ac-cording to the equivalence [2]

�� � � ������ (24)

The expression of �� in terms of � based on the camera model and (4),thus

��� � � �� �� � � ��� ����� (25)

By substituting this formula into (22) and utilizing the definition of thecamera center � �� � � (note that the camera center is the 1-D rightnull-space � of � ), � may be written as

� � �� �� ���� �� �����

� �� ��� ����� ��� � ���

��

� ���� � � � �����

� � (26)

The next step employs the following relationship:

���� � �����

� � ���� � �����

� �����

� � (27)

Substituting this into (26) the final formula for � emerges as

� � ������� � ������ (28)

From this formula, we see that � is skew-symmetric and is formedfrom the VP.

Note that the layout of � is similar to (23)

� �

� �� ��

� � ������ �� �

� (29)

In case of skew symmetric matrix � , the fundamental constraint maybe transformed into the collinearity constraint: namely that the points��� and �� lie on a common line. This follows directly from therewritten form of the fundamental constraint

���� � ��� � ���� ���� � ���� � ���� ��

� � ���� � � (30)

where ��� and ��� are an arbitrary corresponding point pair (� and �� ��denote the cross product and the dot product, respectively). In this for-mula, the homogenous forms of the vectors are used. Because of thefact that the cross product of the vectors expresses the equation of astraight line through these two points, and furthermore, that the dotproduct is a simple substitution into this line equation, the whole ex-pression becomes zero when the third point lies on the line defined bythe two points.

APPENDIX IIPROOF OF PROPERTY 2

The statement is equivalent to

��� � �� � ��� � ��� (31)

where ��� denotes the Euclidian length of vector �. The simplest formcan be derived by using the dot product formula based on the anglesindicated in Figure � � � �. It can be justified by using some ele-mentary linear algebra.

Note that the property 2 is reversed in case of obtuse angle (see as-sumptions).

Page 7: The Use of Vanishing Point for the Classification of ...web.eee.sztaki.hu/szdemo/data/tip-LHavasi-2017137-proof.pdf · The Use of Vanishing Point for the Classification of Reflections

IEEE

Proo

f

HAVASI et al.: USE OF VANISHING POINT FOR THE CLASSIFICATION OF REFLECTIONS 7

REFERENCES

[1] H. Mitsumoto and S. Tamura, “3-D reconstruction using mirror imagesbased on a plane symmetry recovering method,” IEEE Trans. on PAMI,vol. 14, pp. 941–946, 1992.

[2] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge, U.K.: Cambridge Univ. Press, 2003.

[3] R. Penne, “Mirror symmetry in perspective,” in Proc. ACIVS LectureNotes Comput. Sci., 2005, pp. 634–642.

[4] R. J. Alexandre, G. G. Medioni, and R. Waupotitsch, “Reconstructingmirror symmetric scenes from a single view using 2-view stereo geom-etry,” in Proc. ICPR, 2002, vol. 6, pp. 12–16.

[5] L. Lee, R. Romano, and G. Stein, “Monitoring activities from multiplevideo streams: Establishing a common coordinate frame,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 22, pp. 758–767, 2000.

[6] P. Spagnolo, T. D’Orazio, M. Leo, and A. Distante, “Advances inbackground updating and shadow removing for motion detectionalgorithms,” in Proc. CAIP, 2005, pp. 398–406.

[7] Y. Benezeth, B. Emile, H. Laurent, and C. Rosenberger, “A real timehuman detection system based on far infrared vision,” in Proc. CIPS,2008, pp. 76–84.

[8] D. Greenhill, J. Renno, J. Orwell, and G. A. Jones, “Learning the se-mantic landscape: Embedding scene knowledge in object tracking,”Real Time Imag., vol. 11, no. 3, pp. 186–203, 2005.

[9] B. Hu, C. Brown, and R. Nelson, Multiple-View 3-D ReconstructionUsing a Mirror, Tech. Rep., Comput. Sci. Dept., Univ. Rochester,Rochester, NY, 2005.

[10] Z. Szlávik, L. Havasi, and T. Szirányi, “Video camera registration usingaccumulated co-motion maps,” ISPRS J. Photogramm. Remote Sens.,vol. 61, no. 1, pp. 298–306, 2007.

[11] L. Havasi and T. Szirányi, “Estimation of vanishing point in camera-mirror scenes using video,” Opt. Lett., pp. 1411–1413, 2006.

[12] L. Havasi and T. Szirányi, “Use of motion statistics for vanishingpoint estimation in camera-mirror scenes,” in Proc. Int. Conf. ImageProcess., 2006, pp. 2993–2996.

[13] Z. Szlávik, L. Havasi, and T. Szirányi, “Geometrical scene analysisusing co-motion statistics,” in Proc. ACIVS, 2007, pp. 968–979.

[14] S. Keren, I. Shimshoni, and A. Tal, “Placing three-dimensional modelsin an uncalibrated single image of an architectural scene,” in Proc. ACMSymp. Virtual Reality Software and Technology, 2002, pp. 186–193.

[15] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting movingobjects, ghosts and shadows in video streams,” IEEE Trans. PatternAnal. Mach. Intell., vol. 25, pp. 1337–1342, 2003.

[16] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity usingreal-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,pp. 747–757, 2000.

[17] Cs. Benedek and T. Sziranyi, “Bayesian foreground and shadow detec-tion in uncertain frame rate surveillance videos’,” IEEE Trans. ImageProcess., vol. 17, no. 4, pp. 608–621, Apr. 2008.

[18] G. Xu and Z. Zhang, Epipolar Geometry in Stereo, Motion and ObjectRecognition. Norwell, MA: Kluwer, 1996.

[19] R. A. Johnson and D. W. Wichern, Applied Multivariate StatisticalAnalysis. Upper Saddle River, NJ: Prentice Hall, 2002.

[20] F. Pernkopf and D. Bouchaffra, “Genetic-based EM algorithm forlearning Gaussian mixture models,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 27, pp. 1344–1348, 2005.

[21] W. Feller, “Laws of large numbers,” An Introduction to ProbabilityTheory and Its Applications, vol. 1, pp. 228–247, 1968.

[22] V. Nguyen, A. Martinelli, N. Tomatis, and R. Siegwart, “A compar-ison of line extraction algorithms using 2D laser rangefinder for indoormobile robotics,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots andSystems, 2005, pp. 1929–1934.

[23] H. Zabrodsky and D. Weinshall, “Utilizing symmetry in the reconstruc-tion of 3-dimensional shape from noisy images,” in Proc. ECCV, 1994,pp. 403–410.

[24] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright, “Conver-gence properties of the Nelder-Mead simplex method in low dimen-sions,” SIAM J. Optim., vol. 9, pp. 112–147, 1998.

[25] O. Kallenberg, Foundations of Modern Probability. New York:Springer-Verlag, 1997.

[26] Z. Szlavik and T. Sziranyi, “Bayesian estimation of common areas inmulti-camera systems,” in Proc. Int. Conf. Image Process., 2006, pp.1045–1048.

[27] J. Borenstein and Y. Koren, “The vector field histogram-fast obstacleavoidance for mobile robots,” IEEE Trans. Robot. Autom., vol. 7, pp.278–288, 1991.

[28] N. Kiryati and A. M. Bruckstein, “Heteroscedastic hough transform(HtHT): An effective method for robust line fitting in the ‘errors inthe variables’ problem,” Comput. Vis. Image Understand., vol. 78, pp.69–83, 2000.

[29] A. Webb, Statistical Pattern Analysis. New York: Wiley, 2004.[30] Cs. Benedek, L. Havasi, Z. Szlávik, and T. Szirányi, “Motion-based

flexible camera registration,” in Proc. IEEE AVSS, 2005, pp. 439–444.[31] D. Hall, J. Nascimento, P. Ribeiro, E. Andrade, P. Moreno, S. Pesnel,

T. List, R. Emonet, R. B. Fisher, J. Santos Victor, and J. L. Crowley,“Comparison of target detection algorithms using adaptive backgroundmodels,” in Proc. PETS, 2005, pp. 113–120.