Atmospheric soundings from AIRS/AMSU/HSB

Automatic 3D object pose estimation in IR image sequences for forward motion applications

Klaus Jäger∗, Marcus Hebel, Karlheinz Bers

FGAN-FOM Forschungsinstitut für Optronik und Mustererkennung (Research Institute for Optronics and Pattern Recognition), Gutleuthausstr. 1, D-76275 Ettlingen, Germany

ABSTRACT The successful mission of autonomous airborne systems like unmanned aerial vehicles (UAV) strongly depends on the performance of automatic image processing used for navigation, target acquisition and terminal homing. In this paper we propose a method for automatic determination of missile position and orientation (pose) during target approach. The flight path is characterized by a forward motion, i.e. an approximately linear motion along the optical axis of the sensor system. Due to the lack of disparity, classical methods based on stereo triangulation are not suitable for 3D object recognition. To handle this we applied the SoftPOSIT algorithm, originally proposed by D. DeMenthon, and adapted it to fit our specific needs: The gathering of image points is done by multi-threshold segmentation, texture analysis, 2D tracking and edge detection. The reference points are updated in each loop and the calculated pose is smoothed using the quaternion representation of the model’s orientation in order to stabilize the computations. We will show some results of image based determination of trajectories for airborne systems. Terminal homing is demonstrated by tracking the 3D pose of a vehicle in an image sequence taken in oblique view and gathered by an infrared sensor mounted to a helicopter. Keywords: Unmanned aerial vehicles, missile guidance, pose determination, object tracking, IR image sequence

1. INTRODUCTION Precision guidance of unmanned aerial vehicles like missiles and drones is still one of the most important subjects in defense research. Operating in standoff mode, their mission profile is typically divided into three phases: launch, cruise and terminal homing. Current system design uses Inertial Measurement Units (IMU) and/or (D)GPS for midcourse guidance. However, due to the fact that inertial data are affected by noise or drift and GPS-data may be disturbed by countermeasures, accurate measurement of missile pose referring to the ground scenario may be difficult. To overcome these problems, navigation update may be accomplished with automatic landmark or target recognition by analyzing image data of high resolution (IR, Visible, SAR). Image based pose determination by 3D target recognition is also suitable to improve the selection of missile impact aimpoints. In this context, the specific conditions of terminal flight trajectories have to be considered for the automatic interpretation of image data: The trajectories are usually characterized by a forward motion along the optical axis of the sensor system. Consequently, the disparities of corresponding image points detected in successive frames are small producing numerical instabilities if classical photogrammetric methods for stereo triangulation are applied. To solve this problem, we propose a model-based approach for single image interpretation independent of (critical) multiview configurations. In section 2 we present the concepts of the method combining image processing algorithms and automatic sensor pose determination. The method is adapted and optimized for the constraints given by military airborne systems used for target engagement. Evaluation of system efficiency is demonstrated by tracking the 3D pose during the approach of a wheeled military vehicle (section 3). Data was taken in oblique view and gathered by an infrared sensor mounted to a helicopter. Finally, conclusions and subjects of future work are discussed in section 4.

∗ [email protected]; phone 49 7243 992-321; fax 49 7243 992-299; www.fom.fgan.de

Automatic Target Recognition XIV, edited by Firooz A. Sadjadi, Proceedings of SPIE Vol. 5426(SPIE, Bellingham, WA, 2004) · 0277-786X/04/$15 · doi: 10.1117/12.542182

37

2. MODEL-BASED 3D OBJECT RECOGNITION IN A SINGLE 2D IMAGE Numerous methods for object recognition by matching three-dimensional geometric models to image features can be found in literature [1]. The problem comprises several tasks:

• Image features: detection of suitable image features of the projected object. • Modeling: definition of suitable 3D object features. • Correspondence problem: matching of image features and model features. • Pose problem: finding the rotation and translation of the object with respect to the sensor coordinate system.

The motivation of this work was to develop a geometric matching algorithm suitable for the constraints given by military airborne systems. In particular, the method should be applicable to image sequences taken in oblique view. Additionally, the interpretation of outdoor scenarios has to be considered, containing a variety of natural and man made objects. Consequently, the method should also be robust to noisy image data of low resolution and inaccurate 3D model description. Examples of general approaches for the pose and correspondence problems that are neither object-specific nor domain specific are key-feature algorithms, generalized Hough transform, tree search and geometric hashing. In addition, heuristic combinatorial optimization techniques are proposed to solve the correspondence problem [2]. These methods differ in strategy for searching and in model projection, in which e.g. full perspective transformation (pin-hole model) is replaced by an approximation (weak perspective or scaled orthographic imaging). For our applications we use the SoftPOSIT algorithm for solving the model-to-image registration problem [5]. The algorithm combines the iterative Softassign algorithm for computing correspondences of point features [4] and the iterative POSIT algorithm for computing object pose under a weak-perspective camera model [3]. The pose and correspondence problem are solved simultaneously and basically no additional information is necessary with which to constrain the pose of the object or to constrain the correspondence of model features to image features. On the other hand, we adapted the algorithm for target approach by considering geometric constraints of target pose and flight dynamics resulting in an optimization and stabilization of the solution process. 2.1. Sensor data Image based navigation for terminal homing has been developed and evaluated by interpreting image data taken with an IR sensor operating with a frame rate of 25 Hz. For data collection the sensor was mounted in the nose of a helicopter under a variable depression angle. Intrinsic sensor parameters like focal length, pixel size and principal point are known. Fig. 1 shows, as example, three single frames out of the sequence taken while approaching the vehicle. The whole sequence consists of more than 1000 video images (depending on the platform speed) taken from about 1 km to 50 m target distance.

Fig. 1. Typical frames out of the infrared image sequence.

38 Proc. of SPIE Vol. 5426

https://www.researchgate.net/publication/220693676_Object_Recognition_by_Computer_The_Role_of_Geometric_Constraints?el=1_x_8&enrichId=rgreq-802395db-9601-4170-8b68-ee06232b4de3&enrichSource=Y292ZXJQYWdlOzI2OTIwMTQ3O0FTOjE1NDcwMjMwNjA5MTAwOUAxNDEzODk1MzEyMTUx

https://www.researchgate.net/publication/3558528_Optimal_Geometric_Model_Matching_under_Full_3D_Perspective?el=1_x_8&enrichId=rgreq-802395db-9601-4170-8b68-ee06232b4de3&enrichSource=Y292ZXJQYWdlOzI2OTIwMTQ3O0FTOjE1NDcwMjMwNjA5MTAwOUAxNDEzODk1MzEyMTUx

https://www.researchgate.net/publication/2736700_Model-Based_Object_Pose_in_25_Lines_of_Code?el=1_x_8&enrichId=rgreq-802395db-9601-4170-8b68-ee06232b4de3&enrichSource=Y292ZXJQYWdlOzI2OTIwMTQ3O0FTOjE1NDcwMjMwNjA5MTAwOUAxNDEzODk1MzEyMTUx

https://www.researchgate.net/publication/222450247_New_Algorithms_for_2D_and_3D_Point_Matching_Pose_Estimation_and_Correspondence?el=1_x_8&enrichId=rgreq-802395db-9601-4170-8b68-ee06232b4de3&enrichSource=Y292ZXJQYWdlOzI2OTIwMTQ3O0FTOjE1NDcwMjMwNjA5MTAwOUAxNDEzODk1MzEyMTUx

For further processing steps, the image data are preprocessed by methods for noise reduction and contrast enhancement. Low-level data processing is done by analyzing the local neighborhood of each pixel applying well known algorithms (Gaussian low-pass filter, median filter and Sobel filter). A correction of geometrical distortions can be achieved by bicubic resampling [6]. 2.2. Extraction of point features by image sequence analysis To get proper image features for the matching process, a method is proposed combining several image processing methods. For robust point extraction, several processing steps are carried out. These are methods for region segmentation, frame to frame object tracking and edge detection. Segmentation of an image can be interpreted as a separation of objects from the background. This is one of the most important steps in the data processing system since the segmented objects can be analyzed individually during further operations. After segmentation, each image is divided into homogeneous regions belonging to the background or to objects. Thus we can analyze the shape of objects to produce feature vectors which can be used for classification. Existing concepts of region segmentation can be separated into three categories. Pixel-based methods only use the gray values of individual pixels. Region-based algorithms analyze values in larger areas. Finally, edge-based methods try to follow edges in the image [6]. In cases where prior knowledge about the object’s shape is available, model-based segmentation can be applied (e.g. Hough transform). The selection of an adequate segmentation procedure is based on the type of input data. We use the multi-threshold segmentation procedure ISOL for the image sequence analysis. ISOL is an abbreviation for “Image Segmentation by Optimization of Threshold Levels“ [7]. This method was developed and implemented at FGAN-FOM and has been adapted for our task. Multi-threshold gray-level slicing procedures are adequate tools for the segmentation of area-shaped objects. In principle, segmentation procedures of this type consist of two main processing phases. In the course of the first phase the gray value images are sliced at several gray value thresholds, resulting in a large number of binary images. In the second phase, a selection process determines the result of the segmentation on the basis of these binary images. Some of the binary images contain (represented as binary areas) parts of objects, whole objects, parts of the background, and finally the whole image for the lowest threshold.

Fig. 2. Examples of binarization of Fig. 1c at different levels (60, 100, 140 and 230). Generally, an object is represented not only once but several times as a series of regions of similar shape and size. Moreover, most generated binary regions do not represent meaningful objects at all. Therefore it is necessary to apply a certain selection process to get those regions that are optimal representations of an object or a part of an object. Ideally, each selected region should contain all pixels belonging to a certain object and no pixels not belonging to this object. This problem is solved by optimization of area size and contrast. The area size of a region corresponds to the number of pixels belonging to this region. The contrast of a region is computed as the geometric mean of the individual contrast values of all the border pixels. Within each series of corresponding regions, the region with the highest contrast is selected as representative of an object. Besides the computation of area size and contrast, several other features are derived from the segmented objects. For example, additional parameters are mean, minimum and maximum gray value, boundary length, perimeter, center of gravity, invariant moments and statistical features.

Proc. of SPIE Vol. 5426 39

https://www.researchgate.net/publication/221114795_Ein_Bildsegmentierer_fur_die_echzeitnahe_Verarbeitung?el=1_x_8&enrichId=rgreq-802395db-9601-4170-8b68-ee06232b4de3&enrichSource=Y292ZXJQYWdlOzI2OTIwMTQ3O0FTOjE1NDcwMjMwNjA5MTAwOUAxNDEzODk1MzEyMTUx

Fig. 3. (Sub)objects detected by the segmentation procedure ISOL. The multi-threshold segmentation procedure ISOL is applied separately for each single image of the sequence. The segmented objects from different images are then fused to produce three-dimensional objects and a symbolic description. This feature vector can be used as an input to a classification system. Tracking of objects from frame to frame is an efficient technique for the elimination of false alarms caused by noise and short-term clutter. Corresponding objects detected in different frames are connected by means of the tracking procedure. If detections from at least two frames are connected, the result is called a track. A search operation is involved to decide whether objects are to be connected or not. In the implemented tracking approach it is checked if the objects in the last n frames have a meaningful connection to an object in the present frame. The criteria for the selection of partner objects are the previously extracted features. If a partner object is determined, it is associated to the existing track or a new track is generated. Based on track analysis, more features can be obtained to describe the resulting 3D objects (e.g. track length or percentage of gaps).

Fig. 4. Results of tracking shown in a three-dimensional space-time cube. Left, results for single images; right, centers of gravity. The next step of image sequence analysis is the detection of edges in each image. Almost every method for edge detection is based on differentiation, which is realized by discrete differences in the case of discrete images. An edge is defined as an extreme in terms of first-order changes. Thus, edges can be found by searching for maxima of the magnitude of the gradient vector. In our approach, the components of the gradient are estimated by computation of the discrete convolution with the corresponding Sobel filter kernel [6]. This yields both magnitude and direction (angle) of the gradient vector. Then, a simple threshold operation is used to find large values in the magnitude of the gradient. This operation produces a binary image B in which pixels with the value one are potential edge pixels. This image is then scanned line by line. If a pixel with the value one is found, it is checked whether there exists a higher value of the gradient magnitude in direction of the gradient within an 8-neighborhood. If this is the case, the corresponding pixel is classified as a non-edge


pixel and is therefore set to zero. This process reduces the image B to lines with a width of a single pixel and is therefore called line thinning. After completion of the thinning procedure the image B is scanned line by line again. Now, if a pixel with the value one is encountered, the maximum of gradient magnitude is traced along the edge of the object (orthogonal to the gradient direction) until a second, smaller threshold is reached. This is often called edge tracking. To get valuable features for subsequent filter operations, edge length and edge strength can also be determined during the tracing. Typically edge detection increases the noise level, therefore combining the gradient computation with a Gaussian smoothing can yield better results. This was also used by Canny for optimal edge detection [8].

Fig. 5. (a) Magnitude of the gradient vector, (b) Gray value representation of the gradient direction, (c) Binary image B before line thinning, (d) Result after line thinning and edge tracking. The previously described methods for region segmentation, object tracking and edge detection can be combined for the purpose of point feature extraction. These point features are required for the task of model-based object recognition and pose estimation by SoftPOSIT. This algorithm tries to solve two different problems simultaneously [5]: The problem of pose estimation consists of finding the rotation and translation of the object with respect to the sensor coordinate system. The pose can be found by linear or nonlinear approximate methods if at least six matches between image features and model features are known. Therefore, the second problem is to find these correspondences. In our case, both the image features and model features are just points. The first idea for the selection of image points would be to choose the results of edge detection. To suppress clutter and image noise, we only use those detected edge pixels for the further processing that belong to a segmented and tracked region of the image sequence. As an example, Fig. 6 shows how the relevant image points are determined for a single image.

Fig 6. (a) Original image, (b) Result of region segmentation, (c) Result of edge detection, (d) Extracted image points. 2.3. Generation of a 3D target model The estimation of sensor pose using SoftPOSIT is done by the automatic registration of 2D image points and 3D model points describing target structure. For our example, the 3D model was generated by a manual construction of a suitable wireframe. The corresponding edges are interpreted as a set of 3D model points with a spacing optimized to sensor resolution. Geometric features of the target are extracted from a construction plan comprising different parameters like distances of corners (cab) and wheels. To reduce the number of possible correspondences for a predefined aspect, points belonging to hidden surfaces of the model are removed (backface culling).


Fig. 7. Several views of the generated 3D model. First row, all model points; second row, results of backface culling. 2.4. Pose estimation by SoftPOSIT The SoftPOSIT algorithm combines two techniques to approximately solve the correspondence and pose problems. The correspondences between image points and model points are estimated by an iterative technique called Softassign which has its roots in improvements of the original Hopfield neuronal network framework [4]. Estimation of the object’s pose is achieved by an iterative technique called POSIT described in [3]. These two techniques are combined into a single iteration loop, which means that the correspondences and the pose are determined simultaneously by minimizing a global objective function. A thorough presentation of the theoretical background is given in the original article [5]. Intermediate results of the SoftPOSIT algorithm during its iterative application to a single image of our exemplary sequence are shown in Fig. 8. The lower left corner shows the image points as they were extracted by the previously described methods. An initial guess for the object’s pose is used in the first iteration step. This is depicted above, superimposed over the original image. Better estimates for the translation vector and the rotation matrix (i.e. the object’s pose) are computed while the algorithm is in progress. The lower row of Fig. 8 shows maxima of the assignment matrix by connecting the corresponding image points to model points. A comparison of the computed pose with the original image data can be seen in the top row. In this example the algorithm converged after 75 iterations, the point where no further improvements in pose estimation were made. The quality of the model-to-image registration can be evaluated by the number of matches and the value of the global objective function. In addition, all the other descriptive features that were derived during region segmentation and edge detection can be used for the purpose of object recognition. However, depending on the initial guesses for translation vector and rotation matrix, the SoftPOSIT algorithm computes pose results that are worse than that depicted in Fig. 8. The following reasons can be pointed out:

• Detection of proper image points is a serious problem, especially for IR images. The IR signature of the object can vary significantly due to different temperatures or meteorological conditions. Therefore, some geometrical edges may not be visible, but a suitable IR-model is not yet available.

• The algorithm often converges to a local minimum of the objective function. There is no guarantee of finding the global optimum starting from an arbitrary initial guess. This depends, among other factors, on the number of occluded model points and the amount of clutter in the image.




Fig. 8. Application of the SoftPOSIT algorithm to a single image. Intermediate results after 1, 20 and 75 iterations. The way of searching for the global optimum proposed by D. DeMenthon is to run the algorithm starting from a lot of different initial guesses and keep the first solution that meets a specified termination criterion. Since this method is too expensive for our application, we try a different approach. If the algorithm converges to an incorrect pose (i.e. a local minimum), the computed pose is somehow unrealistic in almost every case. For our example, the vehicle will always stand on its wheels and never be upside down. In most practical cases secondary information about the sensor’s position is available by means of GPS or IMU data. This information yields restrictions of the pose possibilities that can be taken into account. To restrict all the possible rotation matrices to those describing a realistic orientation of the model, the three Euler angles of the rotation are computed in every iteration step. By analyzing these Euler angles it is easy to decide whether the corresponding pose is feasible or not. In the latter case, the irregular pose is adjusted while the algorithm is in progress.

3. POSE DETERMINATION FOR TERMINAL HOMING The method for pose determination described in section 2 is based on a single image interpretation and may be used independently for each frame of the image sequence. Starting with random values for initial pose, iterations are carried out until the stop criterion for matching is met. For terminal homing, knowledge about small changes of sensor pose in successive frames may be used for optimizing the computation of the flight trajectory. Therefore, the following modifications of the original approach are implemented in the system: For the reduction of processing time and the enhancement of numerical stability, the values for sensor pose computed for the actual image are interpreted as initial values for the successive frame. In addition, the number of iterations of the SoftPOSIT algorithm is delimited by the user and the analysis of the successive frame is started without meeting the stop criterion. Our investigations have shown that a single iteration loop is adequate if a proper averaging of the estimated sensor pose is accomplished at the same time. In doing so, we applied a moving average algorithm by transforming the 3x3 rotation matrices into the space of 4D quaternions [9]. The complete set of sensor pose data computed for each frame of the sequence represents the sensor trajectory. Fig. 9 shows, as an example, the results for three frames taken while approaching the vehicle. At present, the suitability of the proposed method is estimated by qualitatively assessing the alignment of target signature and 3D model features (lines).


https://www.researchgate.net/publication/2717790_Smooth_Interpolation_of_Orientations_with_Angular_Velocity_Constraints_using_Quaternions?el=1_x_8&enrichId=rgreq-802395db-9601-4170-8b68-ee06232b4de3&enrichSource=Y292ZXJQYWdlOzI2OTIwMTQ3O0FTOjE1NDcwMjMwNjA5MTAwOUAxNDEzODk1MzEyMTUx

These are projected into the image plane considering the actual values of the pose (Fig. 9a). For visualization the resulting wire-frame is transformed into a surface model and rendered in a synthetic scenario (Fig. 9b). Sensor pose may also be visualized by a simulated static observer imaging the complete scenario. Actual sensor pose and sensor field of view are represented by pyramids (Fig. 9c).

Fig. 9. (a) Alignment of IR signature and target model, (b) Synthetic scenario, (c) Sensor pose and sensor field of view. For the evaluation of the algorithm, successive position values are interpolated generating a smooth trajectory. Comparing the estimated trajectory with the assumed straight target approach yields the following qualitative assessment: the error in pose estimation is a decreasing function of distance for intermediate target distances. At greater distances, low target resolution causes inaccurate extraction of image structures resulting in significant errors and oscillations of the computed trajectory. For closer ranges the simplified model of a weak perspective camera assumed by SoftPOSIT may be inadequate producing an increase in pose error (Fig. 10). In this case, sensor pose may be extrapolated by prediction from the previous image and Kalman filtering.

4. CONCLUSIONS The proposed method for image based estimation of navigation data represents a combination of image processing algorithms for the detection of relevant target features and a method for solving the model-to-image registration problem. Using the calculated pose parameters, the alignment of projected model structures and image signature is used for visual assessment indicating the successful estimation of sensor trajectory by the model based analysis.


Subject of future work is to improve the robustness and to optimize processing time by incorporating knowledge about measured pose data supplied for example by inertial measurement units (IMU) or GPS systems. Another subject of ongoing investigations is to consider scaling effects in the field of view. Especially for terminal homing the change in target resolution during approach flight has to be considered by using adaptive models with increasing levels of detail.

Fig. 10. (a) Pose error in relation to the distance between sensor and object, (b) Absolute error of the computed pose.

REFERENCES 1. W. E. L. Grimson, Object recognition by computer: The role of geometric constraints, The MIT press, Cambridge,

Massachusetts, 1990. 2. J. R. Beveridge, M. Riseman, Optimal geometric model matching under full 3D perspective, Computer Vision and

Image Understanding, 61(3), pp. 351-364, 1995. 3. D. DeMenthon, L. S. Davis, Model-Based Object Pose in 25 Lines of Code, International Journal of Computer

Vision, vol. 15, pp. 123–141, 1995. 4. S. Gold, A. Rangarajan, C. P. Lu, S. Pappu, E. Mjolsness, New Algorithms for 2D and 3D Point Matching: Pose

Estimation and Correspondence, Pattern Recognition, vol. 31, pp. 1019–1031, 1998. 5. P. David, D. F. DeMenthon, R. Duraiswami, H. Samet, SoftPOSIT: Simultaneous Pose and Correspondence

Determination, European Conference on Computer Vision, Copenhagen, Denmark, May 2002, pp. 698-714. 6. B. Jähne, Digital Image Processing, 5th revised and extended edition, Heidelberg, Germany, Springer 2002. 7. C. Anderer, U. Thönnessen, M. F. Carlsohn, A. Klonz, Ein Bildsegmentierer für die echtzeitnahe Verarbeitung,

Mustererkennung 1989, 11. DAGM-Symposium, Hamburg, Oct 1989, Proceedings. H. Burkhardt, K. H. Höhne, B. Neumann (Eds.), Informatik-Fachberichte 219: 380-384, Springer Verlag, Berlin, Heidelberg, 1989.

8. J. F. Canny, A Computational Approach to Edge Detection, PAMI, 8: 679-698, 1986. 9. A. H. Barr, B. L. Currin, S. Gabriel, J. F. Hughes, Smooth interpolation of orientations with angular velocity

constraints using quaternions, Computer Graphics, 26(2), pp. 313-320, 1992.















Documents

Atmospheric soundings from AIRS/AMSU/HSB