17
Pattern Recognition 35 (2002) 1463–1479 www.elsevier.com/locate/patcog Retrieval by classication of images containing large manmade objects using perceptual grouping Qasim Iqbal, J.K. Aggarwal Department of Electrical and Computer Engineering, Computer and Vision Research Center, The University of Texas at Austin, Austin, TX 78712, USA Received 17 October 2000; accepted 13 July 2001 Abstract This paper applies perceptual grouping rules to the retrieval by classication of images containing large manmade objects such as buildings, towers, bridges, and other architectural objects. The semantic interrelationships between primitive image features are exploited by perceptual grouping to extract structure to detect the presence of manmade objects. Segmentation and detailed object representation are not required. The system analyzes each image to extract features that are strong evidence of the presence of these objects. These features are generated by the strong boundaries typical of manmade structures: straight line segments, longer linear lines, coterminations, “L” junctions, “U” junctions, parallel lines, parallel groups, “signicant” parallel groups, cotermination graph, and polygons. A K -nearest neighbor framework is employed to classify these features and retrieve the images that contain manmade objects. Results are demonstrated for two databases of monocular outdoor images. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Perceptual grouping; Structure; Content-based image retrieval; Image databases; Multi-media systems; Nearest neighbor classier 1. Introduction Computer vision imposes unique requirements on the representation and manipulation of image data and knowledge [1]. Lower-level approaches analyze an im- age on a strictly quantitative basis for color, intensity, contrast and texture features. Higher-level techniques extract semantic information that may be used by a higher-level reasoning process [2]. The objective of this This work was supported in part by the Army Research Of- ce under contracts DAAD19-00-1-0044, DAAG55-98-1-0230 and DAAD19-99-1-0012 (Johns Hopkins University subcon- tract agreement 8905-48168). Corresponding author. Tel.: +1-512-471-1369; fax: +1-512-471-5532. E-mail addresses: [email protected] (Q. Iqbal), [email protected] (J.K. Aggarwal). research is to develop and apply computer vision meth- ods of higher-level image analysis to image retrieval by classication. In this paper, we develop a framework that exploits the image structure that represents large man- made objects (e.g. buildings, towers, bridges and other architectural objects) for the retrieval of images from a database. We have applied perceptual grouping princi- ples for structure extraction for image retrieval. The human visual system can detect many classes of patterns and statistically signicant arrangements of im- age elements. Perceptual grouping refers to the human visual ability to extract signicant image relations from low-level primitive image features without prior knowl- edge of the image content [2]. Research in perceptual grouping was started in the 1920s by the Gestalt psychol- ogists, whose goal was to discover the underlying princi- ple that would unify the various grouping phenomena of human perception. The word Gestalt means “whole” or 0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII:S0031-3203(01)00139-X

Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Pattern Recognition 35 (2002) 1463–1479www.elsevier.com/locate/patcog

Retrieval by classi"cation of images containing largemanmade objects using perceptual grouping�

Qasim Iqbal, J.K. Aggarwal ∗

Department of Electrical and Computer Engineering, Computer and Vision Research Center,The University of Texas at Austin, Austin, TX 78712, USA

Received 17 October 2000; accepted 13 July 2001

Abstract

This paper applies perceptual grouping rules to the retrieval by classi"cation of images containing large manmadeobjects such as buildings, towers, bridges, and other architectural objects. The semantic interrelationships betweenprimitive image features are exploited by perceptual grouping to extract structure to detect the presence of manmadeobjects. Segmentation and detailed object representation are not required. The system analyzes each image to extractfeatures that are strong evidence of the presence of these objects. These features are generated by the strong boundariestypical of manmade structures: straight line segments, longer linear lines, coterminations, “L” junctions, “U” junctions,parallel lines, parallel groups, “signi"cant” parallel groups, cotermination graph, and polygons. A K-nearest neighborframework is employed to classify these features and retrieve the images that contain manmade objects. Results aredemonstrated for two databases of monocular outdoor images. ? 2002 Pattern Recognition Society. Published byElsevier Science Ltd. All rights reserved.

Keywords: Perceptual grouping; Structure; Content-based image retrieval; Image databases; Multi-media systems; Nearest neighborclassi"er

1. Introduction

Computer vision imposes unique requirements onthe representation and manipulation of image data andknowledge [1]. Lower-level approaches analyze an im-age on a strictly quantitative basis for color, intensity,contrast and texture features. Higher-level techniquesextract semantic information that may be used by ahigher-level reasoning process [2]. The objective of this

� This work was supported in part by the Army Research Of-"ce under contracts DAAD19-00-1-0044, DAAG55-98-1-0230and DAAD19-99-1-0012 (Johns Hopkins University subcon-tract agreement 8905-48168).

∗ Corresponding author. Tel.: +1-512-471-1369; fax:+1-512-471-5532.E-mail addresses: [email protected] (Q. Iqbal),

[email protected] (J.K. Aggarwal).

research is to develop and apply computer vision meth-ods of higher-level image analysis to image retrieval byclassi"cation. In this paper, we develop a framework thatexploits the image structure that represents large man-made objects (e.g. buildings, towers, bridges and otherarchitectural objects) for the retrieval of images from adatabase. We have applied perceptual grouping princi-ples for structure extraction for image retrieval.The human visual system can detect many classes of

patterns and statistically signi"cant arrangements of im-age elements. Perceptual grouping refers to the humanvisual ability to extract signi"cant image relations fromlow-level primitive image features without prior knowl-edge of the image content [2]. Research in perceptualgrouping was started in the 1920s by the Gestalt psychol-ogists, whose goal was to discover the underlying princi-ple that would unify the various grouping phenomena ofhuman perception. The word Gestalt means “whole” or

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.PII: S0031-3203(01)00139-X

Page 2: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1464 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

“con"guration”. Gestalt psychologists observed the ten-dency of the human visual system to perceive con"gu-rational wholes, with rules that govern the uniformity ofpsychological grouping for perception and recognition,as opposed to recognition by analysis of discrete primi-tive image features. The grouping principles proposed byGestalt psychologists embodied such concepts as group-ing by proximity, similarity, continuation, closure, andsymmetry [3].Perceptual grouping of low-level image descriptions

provides a higher-level structure. These higher-levelstructures can be further combined to yield another levelof higher-level structures. The process may be repeateduntil a meaningful representation is achieved that canbe used by a higher-level reasoning process [4]. It isbelieved that the human visual system performs a simi-lar hierarchical structuring of lower-level image featuresinto higher-level representations.The "nal structure obtained after grouping all lower

level features to a higher-level structure will representthe shape of an object in an image. A precise model ofthe object may be still required for recognition. Interest-ingly, a decision regarding the presence of manmade ob-jects may be made without the need to localize or recog-nize a speci"c object. Recognition of a particular objectrequires more knowledge about object properties. In ourapproach, segmentation and detailed object representa-tion are not required.We illustrate the eKcacy of our idea by applying

the general rules of perceptual grouping to two imagedatabases of monocular outdoor images to retrieve im-ages containing manmade objects. To detect the presenceof manmade objects from primitive image features usingthe principles of perceptual grouping, the following fea-tures are extracted hierarchically from an image: straightline segments, longer linear lines, coterminations, “L”junctions, “U” junctions, parallel lines, parallel groups,“signi"cant” parallel groups, cotermination graph andpolygons. The presence of these distinguishing featuresin an image follows the “principle of non-accidentalness”[3,5] and, therefore, is more likely to be generated bymanmade objects. Hence, these features can be consid-ered to be the discriminating criteria between imagesthat contain signi"cant manmade objects and those thatdo not.A three-dimensional feature vector is extracted from

these features. The feature space is assumed to be par-titioned into three classes: images containing signi"cantstructure exhibited by manmade objects, images contain-ing none, and images containing intermediate-level struc-ture. We analyze two databases containing color imagesof monocular outdoor scenes, using a K-nearest neighborscheme to classify the images.The rest of the paper is organized as follows: Section

1.1 presents the current image retrieval paradigms, Sec-tion 1.2 describes previous work in the application of

perceptual grouping, Section 2 explains the extraction ofstructure via perceptual grouping, Section 3 summarizesthe statistics generated and the feature vector extracted,Section 4 outlines the results obtained, and "nally,Section 5 presents the conclusions.

1.1. Paradigms in image retrieval: model-based andview-based

Recognizing objects in a scene is a primary task ofcomputer vision. The ability to extract and describe dis-tinct 3D objects in a complex scene is crucial for imageunderstanding. Broadly speaking, we can classify objectsin a natural scene into two types: manmade and naturalobjects. Computer perception of manmade objects in out-door scenes is a challenging task due to the presence ofboth manmade and natural objects. Natural objects, suchas trees and other vegetation, rivers, rocks and clouds,coexist with manmade objects and are unspeci"ed; theirappearance is unpredictable. Very few objects have com-pact shape descriptions, and it is diKcult to establishcomplete boundaries between the objects of interest andthe background objects [6]. The complexity of this taskdepends on many factors, such as the number and com-plexity of objects, the number of objects in the modeldatabase, and the amount of a priori information aboutthe scene.Current techniques in image retrieval fall into two

broad categories: model-based and view-based. Whileview-based techniques are based on recognition ofsimilar visual attributes such as color and texture,model-based techniques retrieve images based on shape,which requires segmentation of the image into objects.The segmented objects are described by means of a3D or 2D CAD model that outlines the interrelationsbetween primitive geometric image features such aslines, vertices and ellipses. Extracted image featuresare used to de"ne the model properties and constructa 3D CAD model of a manmade object of interest [7].Model-based approaches are generally used for man-made object recognition. Manmade objects are generallyrigid, therefore these models can adequately de"ne theirshape descriptions. However, the model-based approachis limited by the lack of 3D or 2D representation ofnatural objects. Model-based techniques extract seman-tic information. They require a priori knowledge aboutthe shape of objects (object models), which is used topredict image features for matching to features in theimage or in a transformed feature space.Segmentation of an image to obtain diNerent regions,

followed by the analysis of their structural informationto recognize the desired model, is a top–down approach.Automatic segmentation and recognition of objects viaobject models is a diKcult task. The quality of segmenta-tion directly aNects the performance of an image retrievalsystem, which tends to classify image similarity on the

Page 3: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1465

basis of similar objects. The complexity arises out of thediKculty in establishing complete boundaries betweenobjects of interests in natural scenes, and the fact thatvery few natural objects have compact shape descriptions[6]. Consequently, image retrieval using model-basedrecognition is used less frequently than view-based ap-proaches.View-based approaches tend to characterize what is

really observed in an image instead of making someunderlying assumptions about the CAD representationsof the objects. These techniques inherently exploit thecolor=intensity or luminance information in the image,as the shape descriptions of the objects by 3D modelsare not available. The basic idea is to use color=intensityinformation, which does not depend on the geometricfeatures of the objects. Current view-based retrievaltechniques analyze image data at a lower level on astrictly quantitative basis for color, intensity, contrastand texture features [8–13].As stated above, model-based techniques for shape de-

termination and recognition of 3D objects [14,15] andsegmentation [16] are used less frequently for image re-trieval. Until a complete solution to the segmentationproblem is found, image retrieval systems must assumesome sort of automatic or semiautomatic segmentation.Input in the form of user-de"ned, manually segmented re-gions [17], automatically segmented objects and editing[18] aids the process of image retrieval. Spatial interrela-tions between objects present in an image, represented asa spatial-orientation graph [19] or a 2D string structure[18,20] summarizing the spatial content of an image in aniconic form, have also been used. User-drawn sketchesare also used for image retrieval. These sketches are usedas elastically deformable templates to de"ne object shapesimilarity for shape matching [21].Our approach is model-based; we search for regular

features (models) mentioned in the introduction usingperceptual grouping to identify the presence of manmadeobjects. However, our approach is bottom–up, as opposedto top–down, since we generate higher-level structuresfrom primitive image features via a hierarchical group-ing process. In our approach, segmentation and detailedobject representation are not required.Many research groups are actively pursuing content-

based indexing, storage and retrieval of images.Some systems have already been built which providecontent-based image retrieval [17,22–24]. These systemsrequire user interaction, which emphasizes shape, color,and texture features to build queries. For a survey on thecontent-based access of images, refer to Ref. [25].

1.2. Previous work in perceptual grouping

To discover and describe structure, the human visualsystem uses a wide array of perceptual grouping mecha-nisms. These range from the relatively low-level mech-

anisms that underlie the simplest principles of groupingand segregation, to relatively high-level mechanisms inwhich complex learned associations guide the discov-ery of structure. Perceptual grouping generally resultsin highly compact representations of images, facilitatinglater processing, storage, and retrieval [26].The importance of perceptual grouping for recogni-

tion cannot be overemphasized. In the absence of infor-mation for perceptual grouping, it is diKcult for humansto make an intelligent decision regarding the structureor recognition of an object. Experiments conducted withline drawings in which most of the elements of signif-icant collinearity, end point proximity, parallelism andsymmetry were removed demonstrated the diKculty per-ceived by humans subjects in recognizing the objects [3].With the addition of an element at a key location, thehuman subjects were able to perceive the line drawingswith remarkable ease. Presumably, if the added elementhad been placed at some location that did not lend it-self to perceptual grouping, the change in recognitiontimes would have been negligible. The ability to inPu-ence recognition times by controlling the formation ofperceptual grouping illustrates the search-based nature ofthis process, and it has been hypothesized that percep-tual grouping can be a key element in search space andrecognition time reduction.Many computer vision systems implicitly use some

aspects of processing that can be directly related tothe perceptual grouping processes of the human visualsystem [27]. Frequently, however, no claim is madeabout the pertinence or adequacy of the digital mod-els, as embodied by computer algorithms, to the propermodel of human visual perception [28]. Edge-linkingand region-segmentation, which are used as structuringprocesses for object recognition, are seldom consideredto be a part of an overall attempt to structure an image[27]. This enigmatic situation arises because researchand development in computer vision is often consideredquite separate from research into the functioning of hu-man vision [3]. A fact that is generally ignored is thatbiological vision is currently the only measure of theincompleteness of the current stage of computer vision,and illustrates that the problem is still open to solution.Perceptual grouping principles have been incorporated

into computer vision research beginning with the detec-tion of manmade objects in the early 1980s. Approachesbased on perceptual grouping principles are computation-ally expensive, but advances in computing technologyhave made them possible. Their functional role has beenaddressed by a number of researchers. The most impor-tant functions of perceptual grouping are considered tobe segmentation, 3D inference, and indexing of worldknowledge [3]. Use of these functions can lead to a sig-ni"cant reduction of the search space for object recog-nition. An interesting formulation of perceptual group-ing principles is presented in Ref. [27], where they have

Page 4: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1466 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

been formulated in an energy minimization frameworkfor object characterization.Various techniques have been proposed for the appli-

cation of perceptual grouping principles to group lowerlevel primitive image features, such as edge points, intomeaningful higher-level representations [29–31]. Solv-ing practical computer vision problems relating to thedetection of manmade regions has accrued greater sig-ni"cance [4,32–35]. Building detection using perceptualgrouping is an emerging area of the application of theGestalt laws of psychology to computer vision. Genericmodels are used to locate buildings in the images. De-tection of buildings in aerial images using the princi-ples of stereo vision has been investigated by a numberof researchers [36]. These researchers have used shad-ows, walls, and shape from shading information extractedfrom the stereo vision pairs for the detection of buildings.Learning systems to automatically detect rooftops arealso being developed [37]. Most of the above-mentionedwork is con"ned to locating manmade objects like build-ings in aerial images.

2. Structural analysis: employing perceptual groupingfor retrieval

This section focuses on the development of a method-ology for feature extraction using higher-level imageanalysis employing structure extracted via perceptualgrouping. The following discussion highlights the ex-traction of salient information from an image that willhelp in making a decision regarding the presence of man-made objects in an image depicting an outdoor scene.The original idea of the Gestalt psychologists was to"nd a simple way of describing visual perception. How-ever, “simplicity” is not a very well-de"ned term. In the1950s, an agreement was reached on a general principleof simplicity, also called “the minimum principle” [38],which stated that if other constraints are equal, then aperceptual response to a stimulus will be obtained thatrequires the least amount of information to specify.This idea has motivated a lot of work. An attempt was

made to describe the visual content of an image by only“high curvature line fragments” [39]. It was shown thatafter retaining only high curvature line fragments, an im-age was still recognizable by humans. This instigateda strong belief that maximum curvature is perceptuallysigni"cant to human vision. However, other researchershave argued that maximum curvature was not percep-tually signi"cant, and that it is rather the perpendicularproximity of the image curves to the projections of theobject curves that is signi"cant for human visual percep-tion [3].A lot of research has been done in the automatic ex-

traction and formulation of signi"cant features from im-ages. Current image retrieval methods cannot determine

the type of image data to be extracted, and some form ofhuman involvement becomes indispensable in the selec-tion of discriminating feature points.At the lowest level of computer vision, potentially

useful image events such as edges and line segmentscan be extracted from an image without any knowledgeof the image content. For unconstrained environments,where the viewing angle and depth are not known, thebottom–up approach of hierarchically grouping meaning-ful edge segments into higher-level structures appears tobe promising for the extraction of semantic information.Certain scene structures will always produce images

with discernable features regardless of viewpoint, whileother scene structures virtually never do. This correlationbetween salience and invariance has suggested that theperceptual salience of viewpoint invariance is due to theleverage it provides for inferring geometric properties ofobjects and scenes. It has been noted that many of theperceptually salient image properties identi"ed by theGestalt psychologists, such as collinearity, parallelism,and good continuation, are viewpoint invariant [40].

2.1. Principle of non-accidentalness

As mentioned above, it is diKcult to establish a pre-cise de"nition of simplicity of interpretations that guidethe perceptual grouping of lower-level primitive fea-tures. One alternative is that the preferred con"gurationof features is one that corresponds to the properties ofmeaningful physical structures present in an image, e.g.,surfaces and objects. Consequently, perceptual group-ing may operate by exploiting regularities in input thatare non-accidental in that they arise due to regularitiesthat are likely to occur in the physical world [5]. Anon-accidental property may be de"ned as a con"gura-tion of features for which the probability of occurringdue to chance is essentially very small; or conversely, asa con"guration for which the number of times it arisesdue to a coherent structure in the physical world is largecompared to the times it arises due to chance. Thus,it can be consistently inferred that a particular featurecon"guration corresponds to a particular structure. Theprinciple of non-accidentalness may be used to accountfor the grouping phenomena discussed earlier. For ex-ample, the principle of proximity may be explained bythe fact that if two features are adjacent in an image,they are likely to be adjacent in the 3D world.

2.2. Extraction of salient information from the images

Manmade objects generally have sharp edges andstraight boundaries. The prominent characteristics of amanmade object are the apparent regularity and the rela-tionship of its component features. An image containinga large manmade object will exhibit a large number ofsigni"cant edges, junctions, parallel lines and groups,

Page 5: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1467

and closed "gures comprised of polygons, comparedto an image with predominantly non-manmade objects(such as scenes of vegetation). These structures aregenerated by the presence of corners in the object, suchas windows, doors and the boundaries of a building.These higher-level features exhibit apparent regularityand relationship, and are strong evidence of structure inan image.Straight lines extracted from images containing no

manmade object are generally randomly distributed. Thepresence of the distinguishing features mentioned abovefollow the “principle of non-accidentalness” and, there-fore, are more likely to be generated by manmade ob-jects. Hence, these features can be considered to be thediscriminating criteria between an image containing pre-dominant manmade objects and an image containing nat-ural objects.

2.3. Data representation

To detect the presence of manmade objects in an imageusing the principles of perceptual grouping, the followingfeatures are extracted from an image:

• straight line segments,• longer linear lines,• coterminations,• “L” junctions,• “U” junctions,• parallel lines,• parallel groups,• “signi"cant” parallel groups,• cotermination graph, and• polygons.

Manmade objects are generally rigid (in a still image,articulated objects also have rigid representations), there-fore, these representations adequately de"ne their shapedescriptions [41]. The extraction process is hierarchical.The "rst stage is the extraction of straight line edges,which tend to be small fragments that are grouped to formlonger linear lines, which then are used to "nd cotermi-nations. A set of lines terminating at a common point iscalled a cotermination. The cotermination is an importantrelation. According to the proximity rule of perceptualgrouping, the human visual system easily groups coter-minous lines. In fact, it has been suggested that the majorfunction of eye movements is to determine coterminousedges [42]. Cotermination is a non-accidental relation-ship and, hence, rePects signi"cant structural information[4]. Coterminations are grouped to extract “L” junctions,and “L” junctions are grouped to get “U” junctions.The linear lines are also used to extract parallel lines. A

large number of manmade objects contain parallel struc-tures. Human vision can rapidly identify parallel lines[3]. A parallel relation is a non-accidental relationship

that can be used to infer relationships in three-space.These parallel lines are grouped together to "nd parallelgroups, which are then used to extract signi"cant parallelgroups. Coterminations are also used to extract a coter-mination graph. Polygons are extracted using the coter-mination graph. A polygon is also a signi"cant image re-lation. According the closure rule of perceptual grouping,human vision tends to complete curves to form enclosedregions. Extracting closed "gures corresponds to this fea-ture of human vision. Polygons are non-accidental im-age relationships, since the coterminations forming themare non-accidental. Hence, polygons represent signi"cantstructures in an image. Polygons may also be used for3D inference, for example, a closed "gure in an imageusually suggests a closed structure in 3D space. Polygonsare higher-level structures than lines and coterminations.The rationale for the perceptual grouping relations de-

scribed above is the following [4]:

• Spatially closed primitive structures are likely to berelated and to rePect meaningful structures. Spatiallyclosed primitive structures are more likely to be per-ceptually grouped according to the proximity group-ing rule.

• Some primitive structures may be caused by accidentalimage relations of natural objects. For example, linesegments extracted from a cluster of tree leaves mayaccidentally form a parallel group or a polygon. Suchparallel groups or polygons tend to be randomly andsparsely distributed and unlikely to form meaningfulstructures.

• Manmade objects usually consist of regular structuresrelated structurally and spatially. The subsequent sec-tions describe in detail the extraction of each of thesedata items.

2.4. Extraction of straight line segments

Burns’ straight lines detector [43] has been used toextract straight line segments. The distinction betweenthe terms “line segments” and “lines” will be made onthe basis that a line is obtained by grouping these frag-mented line segments. Burns’ operator uses the orienta-tion of the local gradient in the neighborhood of a pixelto obtain “line-support” regions of similar gradient ori-entation. The structure of the associated intensity surfaceis examined to extract the location and properties of theedge.

2.5. Extraction of strong-edged lines

Burns’ operator generates a lot of “spurious” line seg-ments, which are low contrast segments that arise fromslight variations in the local intensity. The line segmentsforming the linear structure of a manmade object shouldbe strong-edged lines.

Page 6: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1468 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

If line segments with edge strength of less than athreshold are retained, then a spurious structure may begenerated by a higher-level grouping process. Thus, it isnecessary to get rid of them at this stage. All segments areexamined for edge strength. To eliminate low contrastline segments, the edge strength associated with a seg-ment is examined and only those segments which meetthe following criteria are retained:

E(Li)¿�e; (1)

where E(·) represents the ratio of the edge strength asso-ciated with a segment Li to the maximum edge strengthin the image, and �e is a threshold. Edge strength is cal-culated as the average gradient magnitude in the supportregion bounded by the segment.For practical purposes, the extracted straight line seg-

ments may not rePect well the linear structures present inthe image. Further processing is required to get “good”linear lines from these segments, as described in the nextsection.

2.6. Extraction of longer linear lines

In practice, the straight line segments obtained fromthe Burns’ operator are fragmented and must be groupedtogether to obtain longer linear lines at a higher gran-ularity level. To accomplish this task, a representativeline to a set of closely bunched and similarly orientedstraight line segments obtained from the Burns’ opera-tor should be found. This is due to fact that the longerstraight line edges represent the linear structure of an ob-ject at a higher granularity level than the line segmentsthemselves [32].The process is described as follows. Many colineariza-

tion techniques [29,30] are not suitable for extracting lin-ear structure because they only link approximate collinearlines by examining the neighborhoods of the end pointsof each line segment. This process is called line segmentextension. A line folding process is described in Ref.[32], where the space around each straight line edge (linesegment) is folded onto the segment repeatedly to ob-tain a single line representing the grouped straight line.This process, however, does not perform line extension.The line folding and extension process described belowis employed [2].Proximate line segments with similar orientations are

perceived as continuous lines, according to Gestalt lawsof proximity, similarity, and continuation, and are per-ceived to have originated from the same linear structurein the scene. Hence, they should be merged into one line.The symmetric orthogonally elongated region of width

�f=2�n with a line segment Lb as the medial axis (asshown in Fig. 1) is searched to collect a set of segments,Sfe (which also includes the original segment denoted asbase segment). This set is replaced by a representativeline Lr if the following conditions are satis"ed:

L2

L1

Lb

Lrδn

Fig. 1. Linear line obtained.

(a)

A(Lb; Li)¡�a; (2)

max{Do(Lb; ei1 );Do(Lb; ei2 )}¡�n; (3)

(b) either

D(ei; ej)¡�n; (4)

or

#(’Lj (Li); Lj)¿ 0; (5)

where Lb denotes the base segment, Li and Lj denote anytwo segments in Sfe;A(·; ·) denotes the absolute value ofthe smaller angle in radians between the two segments,and �a is a threshold. In the above equations, ei and ej arethe end-points of the segments Li and Lj, respectively, (ei1and ei2 are the two end-points of the segment Li);Do(·)is the orthogonal distance of an end-point to a segment,and D(·) represents the distance between the end-pointsof Li and Lj that are nearer to each other. In the lastequation, ’Lj (Li) is the orthogonal projection of Li ontoLj, and #(·) outputs the length of the overlap betweenany two lines.Eqs. (2) and (3) ensure that all segments in Sfe are

approximately collinear to Lb. Eqs. (4) and (5) requirethat any segment in Sfe must be either close to (end-pointsfall in a circular neighborhood of radius �n), or overlapat least one other segment in Sfe, respectively, to ensurecontinuity.To "x the representative line Lr , as shown in Fig. 1,

we need one point through which it passes (we calcu-late the mid-point of Lr), its orientation, and its length.The mid-point and orientation of Lr are obtained as theweighted average of the mid-points and the orientationsof all line segments in the set Sfe, respectively. Theweights are determined using the lengths of the seg-ments. To obtain the length of Lr , the end-points of allsegments in Sfe are orthogonally projected onto Lr andthe two farthest points are taken as the end-points ofLr [44].

Page 7: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1469

Fig. 2. Extraction of coterminations.

The process is continued until no merging occurs. Ter-mination of the process is guaranteed after obtaining a"nite number of longer linear lines, as there are a "-nite number of line segments and their number decreasesafter each iteration.

2.6.1. Extraction of longer linesAfter the extraction of linear lines by the line folding

and extension process described above, some small linesmay still exist in the image. These lines do not have anyother lines in their neighborhoods and, hence, could notbe grouped to obtain larger lines. These lines are not verysigni"cant in describing the linear structure of a manmadeobject and may be eliminated. Their elimination will alsoreduce the computational cost. All lines, including thelonger linear lines, are analyzed, and those lines meetingthe following criteria are retained:

L(Li)¿�l; (6)

whereL(·) outputs the length and �l is a threshold. Theselines are referred to as the “retained lines”.

2.7. Extraction of coterminations

A set of lines terminating at a common point is calleda cotermination (Fig. 2). In practice, a small neighbor-hood is constructed around a point to allow the lines toterminate in a small common region. Each pair of lines{Li; Lj} in a cotermination satis"es the following condi-tions:

�c6 �6�− �c; (7)

max{abs(dy(ei; ej)); abs(dx(ei; ej))}6 �n; (8)

where � is the angle between Li and Lj; �c is the similar-ity angle, and �n is a threshold. The cotermination is theintersection of two lines obtained by extending the end-points ei and ej of lines Li and Lj, respectively, whichfall in a common region. In the above equation, dy(ei; ej)and dx(ei; ej) are diNerences in the y and x coordinates

Fig. 3. Extraction of “L” junctions.

of ei and ej, respectively, and abs(:) is the absolute valueof the input argument. Eq. (7) ensures that lines whichare approximately collinear are eliminated. Eq. (8) re-quires the two lines to fall within a small neighborhoodto constitute a valid cotermination.

2.8. Extraction of “L” junctions

“L” junctions are formed by coterminations that havean internal angle close to �=2 rad. These junctions arestrong evidence of salient corners in an image and, hence,indicators of a manmade structure (e.g., one that maybe attributed to a building). Each pair of lines {Li; Lj},where i �= j, in an “L” junction satis"es the followingconditions:

D(ei; ej)¡�n; (9)

�2−A(Li; Lj)¡�la; (10)

where �la is a threshold. Eq. (10) requires that the angleformed by Li and Lj be within a threshold �la from �=2(Fig. 3). In an oblique view, the angle enclosed by an “L”junction may be far from �=2. Relaxation of the value of�la can accommodate an oblique viewing angle.It may be noted that two “L” junctions may share a

common line. It is important to consider this when linesin “L” junctions are counted, so that common lines arenot counted more than once.

2.9. Extraction of “U” junctions

A “U” junction is formed by the alignment of two “L”junctions. “U” junctions are generated by regular man-made structures such as windows and doors in a build-ing, in particular, and generally by regular structures inother manmade objects. Their presence in an image is astrong indication of the presence of a manmade object.Two types of “U” junction are possible: (1) “U” junc-tions that have a common line between the “L” junctions,and (2) those that do not have a common line. For the

Page 8: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1470 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

Invalid "U" junction

Valid "U" junction

L12

L22L21

L11 Lur

(a) (b)

Fig. 4. Extraction of “U” junctions.

"rst case, the direction in which the “non-common” lines“point” should be approximately similar for the two “L”junctions to constitute a valid “U” junction, as shown inFig. 4(a).For the second case, two “L” junctions are grouped

together to form a “U” junction if they satisfy the fol-lowing conditions:

(a)

A(L11; Lur)¡�ua; (11)

A(L21; Lur)¡�ua; (12)

(b) either

D(e11; e21)¡ 2�n; (13)

or

#(’L21 (L11); L21)¿ 0; (14)

(c) L12 and L22 point in approximately similar direction.

Where L11 and L21 are the lines in the two “L” junc-tions that are perceived to be “pointing” towards eachother, as shown in Fig. 4(b), L12 and L22 are the two“other” lines in the “L” junctions, Lur is the represen-tative line obtained by joining the mid-points of the in-ternal end-points of the lines forming the “L” junctions,and �ua is a threshold.

Eqs. (11) and (12) imply that the angles between Lurand L11, and between Lur and L21, should be less than�ua to ensure a valid “U” junction. The threshold inEq. (13) may be set to any value close to but greater than�n (in Section 2.6). If this threshold is set less than orequal to �n, then no groupings from disjoint “L” junc-tions, with their end-points falling in a small neighbor-hood, may be possible, as L11 and L21 may already havebeen grouped together to form a larger collinear line. Wehave chosen the threshold value of 2�n for convenience.If this condition is not satis"ed, then Eq. (14) examines

the lines L11 and L21 to see if there is overlap betweenthem to constitute valid lines for a possible “U” junction.If more than one “L” junction ful"lls all of the abovecriteria for a particular target “L” junction, then the “L”junction for which the length of Lur is the smallest ismatched with the target “L” junction.It should be noted that a “U” junction resulting from

an “L” junction and a “single” line is not possible. Thisis due to the fact that if the single line is close to one ofthe lines of an “L” junction, then that single line alreadyforms a valid “L” junction with the line in the “L” junc-tion close to it. Hence, only combinations of “L” junc-tions yield the desired “U” junctions.

2.10. Extraction of parallel lines

All “retained” lines are searched to extract sets of par-allel lines. A set is collected by picking a base line Lb inthe image and "nding all lines that satisfy:

A(Lb; Li)¡�a; (15)

where Li is a line other than Lb and �a is a threshold.The parallel lines obtained are grouped into clusters ofsimilar orientations to avoid a brute-force search in thedetection of parallel groups.

2.11. Extraction of parallel groups

Parallel groups are obtained by grouping the paral-lel lines that signi"cantly overlap each other. The over-lapping is determined by orthogonal projection [32,33].However, certain groupings of parallel lines that haveintrinsic orientation diNerent from their local orientationmay not be grouped together.Consider the image shown in Fig. 5. In Fig. 5(a), a

linear line segment L1 is orthogonally projected onto an-other linear line segment L2. The end points of the pro-jected line are given as A and B, as shown in the "g-ure. There is a signi"cant overlapping of the projectionof line L1 onto line L2. Therefore L1 and L2 are said toconstitute a valid parallel group. In Fig. 5(b) and 5(c),

Page 9: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1471

Localorientation

Intrinsicorientation

Intrinsic orientation

Localorientation

L1

L2

L1 L2 L3

L1

L2(b)

(a)

(c)

Fig. 5. Extraction of parallel groups.

there is an angular diNerence between the intrinsic andlocal orientations of the set of lines, resulting in insignif-icant overlap in the orthogonal direction. However, thereis enough overlap in the vertical and horizontal projec-tions of the lines in Fig. 5(b) and (c), respectively, toconstitute a valid parallel group.We employ a general procedure for extracting parallel

groups that also accounts for the diNerence in the intrinsicand local orientations of lines [4]. A parallel group is aset of lines Spg= {L1; L2; : : : ; LM}, where M¿ 2, whichsatis"es the conditions that for each line Li ∈ Spg, thereexists a line Lj ∈ Spg, where i �= j, such that:

(a) Li and Lj have “similar” lengths, i.e.,

L(Li)L(Lj)

¿�pg1 ; (16)

where �pg1 is a threshold. (Li is assumed to be shorterin length than Lj, here and subsequently.) The threshold�pg1 may be relaxed to accommodate a possible obliqueview.(b) Li and Lj are “relatively” close, i.e.,

D(emidi ; emidj)Lavg(Li; Lj)

¡�pg2 ; (17)

where emidi and emidj are the mid-points of Li and Lj,respectively, Lavg(·) denotes the average length of thetwo lines, and �pg2 is a threshold.(c) Li and Lj have “suKcient overlap” in one of the threeprojections, i.e.,

#(’Yaxis(Li); ’Yaxis(Lj))L(’Yaxis(Li))

¿�pg3 ; (18)

#(’Xaxis(Li); ’Xaxis(Lj))L(’Xaxis(Li))

¿�pg3 ; (19)

#(’Lj (Li); Lj)L(’Lj (Li))

¿�pg3 ; (20)

whereYaxis andXaxis represent the Y -axis and the X -axisof an image, respectively, and �pg3 is a threshold.

Perceptual grouping rules of proximity, parallelism,and similarity have been utilized to formulate the aboverules, which ensure that lines with similar length, orienta-tion, closure, and suKcient overlap are grouped together.The lines that do not follow these rules are discarded andnot grouped together because they are not likely to orig-inate from the same parallel group.

2.11.1. Extraction of “signi@cant” parallel groupsSome of the parallel groups obtained in the last section

may not be considered to be “signi"cant”, since they maybe generated by structures other than the sharp corneredges generated by, say, windows or doors in a building.Therefore, the following criteria are adopted to identifysigni"cant parallel groups:

(a) at least one line in the parallel group is enclosed byan “L” or a “U” junction.(b) either

A(Li;Yaxis)¡�spga; (21)

or

A(Li;Xaxis)¡�spga; (22)

where Li is any line in a parallel group and �spga is athreshold. The threshold �la (in Section 2.8) accommo-dates an oblique viewing angle in (a). Eqs. (21) and(22) provide a control on the obliqueness of the viewingangle. The threshold �spga may be set equal to �=4 radi-ans to allow parallel groups in any orientation (viewingangle).

2.12. Extraction of polygons

Polygons are closed "gures formed by non-parallellines. Finding closed "gures may be done by tracing thelines of coterminations. Starting from an end-point of aline and going along the given lines or cotermination, ifwe can come back to the starting point, a closed "gureis found. Practically, as in extracting coterminations, wecannot expect a point connection between lines and totrace back to a point. Instead, we should consider smallregions. However, the time-complexity of this kind ofdirect search is exponential. Alternatively, a polynomialtime algorithm based on elements of graph theory [45] isemployed to extract polygons from an image using thecotermination graph. The underlying idea is to take ad-vantage of the one-to-one correspondence between theclosed "gures comprised of lines and the circuits in thegraph. The approach "nds a set of independent and suf-"ciently closed "gures using the cotermination graph. Aset of fundamental circuits is then searched and extracted.The closed "gures (or circuits) are then checked with theconditions described below to determine if they consti-tute a valid polygon.

Page 10: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1472 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

(a) (b)

Fig. 6. (a) Closed "gures comprising polygons. (b) Closed "gures not comprising polygons.

Let G=(V; E) be a cotermination graph, where Vand E are the set of vertices and the set of edgesof G, respectively. Let Seij ∈E be an edge connect-ing vertices vi ; vj ∈V . The weight of Seij is de"ned asw( Seij)=deg(vi) + deg(vj), where deg(·) is the degreeof a vertex, that is, the number of edges incident with thevertex. The edge weights are collected by extracting theadjacency matrix of the graph. If two (diNerent) verticesare connected to each other, the adjacency matrix has thecorresponding entry set to 1, otherwise it is set to 0. Ad-jacency matrix is symmetric. The connected componentsof the graph are found, and each sub-graph correspondingto each component is processed separately. The weightof a spanning tree is the sum of the weights of all thebranches in the spanning tree. We search for the maximalspanning tree, which may be found by slightly alteringthe minimal spanning tree algorithm to incorporate thevertices resulting in the maximal-weight spanning tree[45]. The maximal spanning tree is employed to extractthe fundamental circuits. Each fundamental circuit rep-resents a closed "gure in the image, where edges on thiscircuit correspond to line segments on the closed "gure.A polygon, P, is de"ned to be that fundamental circuit

extracted that meets the following requirements [4]:

(a) the polygon is simple, i.e., the edges of the polygondo not intersect among themselves,

(b) the polygon is relatively compact,

Q(P)6 �QQ(C(P)); (23)

where C(·) is the convex hull of P, and �Q is athreshold, and Q(·) is de"ned as:

Q(P)=perimeter2(P)area(P)

; (24)

(c) the polygon does not have many cavities,

ni6 �cvncv; (25)

where ni is the number of vertices of P inside C(P),ncv is the number of vertices of P on C(P), and�cv6 1 is a constant,

(d) the number of edges on the polygon does not exceeda given threshold, �ne.

Fig. 6(a) displays some closed "gures that comprisepolygons, whereas Fig. 6(b) shows some closed "guresthat do not constitute polygons.

3. Feature extraction and classi�cation

This section explains the process of feature extractionfrom an image and the corresponding classi"cation of theimage.

3.1. Statistical analysis of data items extracted froma sample image

Table 1 lists the signi"cant data items extracted fromthe image shown in Fig. 7. These statistics were collectedby associating a data structure of the following type toeach of the lines:Structure : begin

T1T2...

TNend:

Here Tj denotes the number of times a particular lineLi is a member of a certain image event Tj and N isthe number of image events. For example Tj may referto the event that the line is one of the lines forming an“L” junction. Membership in a particular image event isdetermined if the following predicate is true:

PTj (Li)=TRUE; (26)

where PTj (·) is the predicate that determines thatthe line is a member of an event Tj. The number oftimes the predicate operating on Li is true is counted

Page 11: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1473

Table 1Statistics of salient data items extracted from the image shownin Fig. 7. ? indicates a key statistic used in the computationof the feature vector

Item #

All lines 5526Lines with edge strength ¡�e 3650Strong-edged lines 1876Lines extended 930Lines obtained by extension 411Lines after extension 1357Lines with length ¡�l 955Retained lines ? 402Coterminations 108“L” junctions 107“U” junctions 27Lines in coterminations 173Lines in “L” junctions ? 172Lines in “U” junctions ? 84Parallel lines 401Parallel types (clusters) 14Parallel groups and polygons 73Signi"cant parallel groups and polygons 57Lines in parallel groups and polygons 401Lines in signi"cant parallel groups and polygons ? 225

Fig. 7. An image of a building.

and the value is denoted by Tj. Common lines are onlycounted once.

3.2. Feature extraction

In the general form, the feature vector extracted isexpressed as:

X=(x1; x2; x3)t ; (27)

where

x1 =# of lines in “L” junctionsTotal # of “retained” lines

; (28)

Table 2Values of thresholds used to generate the results

�e �a �n �c �l �la �ua

0.1 5◦ 5 30◦ 10 30◦ 30◦

�pg1 �pg2 �pg3 �spga �Q �cv �ne

0.5 2.5 0.5 30◦ 2.25 0.33 20

x2 =# of lines in “U” junctionsTotal # of “retained” lines

; (29)

x3 =# of lines in (signi"cant) parallel groups and polygons

Total # of “retained” lines:

(30)

The numerators in Eqs. (28)–(30) are normalized by thenumber of “retained” lines to ensure fair comparison be-tween images. Common lines in the respective calcula-tions of x1; x2 and x3 are only counted once.As evident, xi ∈ [0; 1], where i∈{1; 2; 3}, i.e., an im-

age is mapped into a feature space bounded by a unitcube. The feature vector X represents the coordinates ofthe mapped image in this space. In our experiments X isobtained by using the threshold values shown in Table 2.The angles are displayed in degrees. These threshold val-ues are kept constant for the generation of results.We have striven to develop a system with no con-

straints on the viewing angle or depth. In the absenceof any knowledge of these two fundamental quantities,we believe that the empirical selection of the above pa-rameters is justi"ed, and that the threshold values shownin Table 2 may be treated as constants instead of vari-able parameters. Moreover, since these values are ap-plied constantly to all images in the database, all imagesare equally aNected. These values were carefully chosento accommodate a wide variety of viewing angles anddepths.

3.3. Identi@cation of image classes

We assume that the image space consists of threeclasses: structure, non-structure and intermediate, whichare denoted as 21; 22 and 23, respectively. The inter-mediate class is added to account for the fact that someoutdoor images are too ambiguous to reliably classify aseither structural or non-structural, even for human oper-ators. Sample images of the three classes are shown inFig. 8.Each of these three classes, 21; 22, 23, has an asso-

ciated discriminant function, denoted as g1; g2 and g3,respectively. Representing an image classi"er in a canon-ical form through a set of these discriminant functions,the classi"er assigns a feature vector X, and hence, theimage from which it is extracted, to class 2i if:

gi(X)¿gj(X); j �= i; i; j∈{1; 2; 3}; (31)

where ties are resolved arbitrarily.

Page 12: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1474 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

Fig. 8. Sample images belonging to the three classes.(a) Struc-ture. (b) Non-structure. (c) Intermediate.

3.4. Selection of gi

Images are assigned to one of the three classes usingthe K-nearest neighbor classi"er. The K-nearest neigh-bor classi"er assigns a feature vector X to the class2i; i∈{1; 2; 3}, with the largest number of nearest train-ing feature vectors in the K nearest vectors to X in thefeature space. i.e., X is assigned to that class which hasthe highest discriminant value given by Eq. (31):

�NN (X)=2i if gi(X)¿gi(X); j �= i; i; j∈{1; 2; 3};(32)

where �NN (X) assigns a class label to the test featurevector X, and gk(X); k ∈{1; 2; 3} is the number of train-ing feature vectors belonging to 2k that are among theK nearest vectors to X in the feature space. A distance

function d(X;Xkn) measures the distance between a testfeature vector X and the nth training vector Xkn belong-ing to the kth class, and may be selected as the l2 norm:

d(X;Xkn)= ||X − Xkn||=√(X − Xkn)t(X − Xkn): (33)

4. Results obtained

This section describes the results obtained by analyz-ing two diNerent image databases, a combined total of2660 images. For the analysis of both image databases,we assumed that the image space consists of three classes:structure, non-structure and intermediate. We seek to re-trieve images in all classes.

4.1. Database #1:

This database consists of 2139 color images of size1024 × 1024 acquired from two CDs obtained fromthe Visual Delights, Inc. (http://www.visualdelights.net), “Austin and Vicinity: The Human World” and“Austin and Vicinity: The world of Nature”.

4.2. Database #2:

The second database consists of 521 color images ofsize 512 × 512 acquired from the ground level using aSony Digital Mavica camera.

4.3. Experiments performed

Figs. 9–11 display the "rst 25 images retrieved foreach class using databases #1 and #2. A total of 48 im-ages are chosen (16 of each of the three classes) as train-ing images. The images were sorted on the basis of thedistance of the corresponding feature vector extractedfrom the center of the corresponding class. The remain-ing 2612 images were classi"ed using a K-nearest neigh-bor classi"er to one of the structure, non-structure or theintermediate classes. The value of K was chosen to be 5.For statistical analysis, database #2 was used to quan-

tify retrieval results in terms of recall=precision and theeKciency of the system. This database contained 255structure class images, 140 non-structure class imagesand 96 intermediate class images. A total of 30 imageswere chosen (10 of each of the three classes) as trainingimages. The value of K was chosen to be 1. Tables 3–6display these results.Table 3 shows the overall retrieval rate. Table 4 dis-

plays class-conditional retrieval performance measuredin terms of recall and precision. Recall is de"ned as thefraction of the total number of images that are correctlyretrieved for a particular class. Precision is de"ned as thefraction of images retrieved for a particular class that are

Page 13: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1475

Fig. 9. First 25 images retrieved for the structure class (databases # 1 and # 2).

Table 3Overall retrieval rate (database # 2). T =Total # of images,D=ENective # of images, C =Correct and RR=Retrieval rate

Total Training ENective Correct RRT D C (C=D)

521 30 491 361 73.52%

Table 4Recall and precision (database # 2). T =Total, R=Retrieved,C =Correct

Class T R C Recall Precision(C=T )% (C=R)%

Structure 255 235 198 77.65 84.26Non-structure 140 143 114 81.43 79.72Intermediate 96 113 49 51.04 43.36

actually correct. The retrieval statistics are shown fullyin the confusion matrix shown in Table 5.Table 6 displays the results of another experiment. It

shows the distribution of images that actually belong

Table 5Confusion matrix (database # 2). Entries presented in rows,e.g., 198 structure class images classi"ed as structure, 16 asnon-structure, and 41 as intermediate

Class Structure Non-structure Intermediate

Structure 198 16 41Non-structure 3 114 23Intermediate 34 13 49

to a particular class within the “best matches” for thatclass, in intervals of 100 images, and the correspondingeCciency of the system. The best matches were obtainedby sorting all images in the ascending order based upontheir distances from the training samples of each class.EKciency is de"ned as the ratio of the number of imagesthat actually belong to a particular class in the block ofclosest best matches, to the size of the block, where theblock size is equal to the number of images correspondingto that class.The proposed system is also applied to special cases

of outdoor images containing natural objects that haveedges that might generate structures similar to those

Page 14: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1476 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

Fig. 10. First 25 images retrieved for the non-structure class (databases # 1 and # 2).

Table 6“Best matches” and eKciency (database # 2). Distribution of images actually belonging to a particular class in the best matches forthat class, in intervals of 100 images (best matches), and the eKciency of the system. T =Total # of images belonging to a certainclass, Q= # of images that actually belong to a certain class in the "rst T best matches for that class, and ED :=EKciency

Class 1–100 101–200 201–300 301–400 401–491 T Q ED :=(Q=T ) %

Structure 78 78 68 24 7 255 200 78.43Non-structure 86 38 15 1 — 140 109 77.86Intermediate 47 21 9 16 3 96 45 46.88

generated by manmade objects. For example, tall treeshave sharp (piece-wise) straight-line edges. Fig. 12 dis-plays four images of trees with such edges. The systemwas able to correctly classify them as belonging to thenon-structure class. Full statistics of these images aredisplayed in Table 7.

5. Conclusion

This paper has presented perceptual grouping as an ef-fective tool for the retrieval by classi"cation of images

containing large manmade objects such as buildings, tow-ers, bridges, and other architectural objects. Based onthe semantic interrelationships of diNerent primitive im-age features, a methodology is presented for the appli-cation of perceptual grouping rules for the extraction ofstructure. In our approach, segmentation and detailed ob-ject representation were not required. Two databases ofmonocular outdoor images were employed to show theeKcacy of using structure for retrieval.The system analyzed each image to extract features

that were strong evidence of the presence of manmade ob-jects. These features are generated by the strong bound-

Page 15: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1477

Fig. 11. First 25 images retrieved for the intermediate class (databases # 1 and # 2).

Fig. 12. Images containing natural objects that have edgessimilar to manmade objects. The images shown are successfullyclassi"ed to the non-structure class.

aries typical of the structures that comprise the manmadeobjects: straight line segments, longer linear lines, coter-minations, “L” junctions, “U” junctions, parallel lines,parallel groups, “signi"cant” parallel groups, cotermina-

Table 7Statistics of images that have edges similar to manmade objects.Some examples of such images are shown in Fig. 12

Item #

Total images 109Incorrectly classi"ed to the structure class 2Correctly classi"ed to the non-structure class 93Incorrectly classi"ed to the intermediate class 14Recall 85.32%

tion graph and polygons. These features were obtainedby perceptual grouping of primitive image features, bybottom–up processing. The image space was partitionedinto three classes: structure, non-structure and interme-diate. A K-nearest neighbor framework analyzed thesefeatures and retrieved images which it perceived to con-tain manmade objects (structure class). The judicious useof structure extracted by perceptual grouping, as demon-strated by the results, illustrates the eKcacy of applyingperceptual grouping rules for the retrieval of images con-taining manmade objects.

Page 16: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

1478 Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479

References

[1] J. Brolio, B.A. Draper, J. Ross Beveridge, A.R. Hanson,ISR: A database for symbolic processing in computervision, Computer 22 (12) (1989) 22–30.

[2] Q. Iqbal, J.K. Aggarwal, Applying perceptual groupingto content-based image retrieval: Building images, IEEEInternational Conference on Computer Vision and PatternRecognition, Fort Collins, CO, June 1999, Vol. 1,pp. 42–48

[3] D.G. Lowe, Perceptual Organization and VisualRecognition, Kluwer Academic Publishers, Hingham,MA, 1985.

[4] H.Q. Lu, J.K. Aggarwal, Applying perceptual organizationto the detection of man-made objects in non-urban scenes,Pattern Recognition 25 (8) (1992) 835–853.

[5] M.J. Tarr, Visual representation: from features to objects,in: V.S. Ramachandran (Ed.), Encyclopedia of HumanBehavior, Vol. 4, Academic Press, San Deigo, CA, 1994,pp. 503–512.

[6] M.A. Fischler, T.M. Strat, Recognizing objects in a naturalenvironment: a contextual vision system (CVS), DARPAImage Understanding Workshop, Palo Alto, CA, 1989,pp. 774–796.

[7] F. Arman, J.K. Aggarwal, CAD-based vision: objectrecognition in cluttered range images using recognitionstrategies, Comput. Vision, Graphics, Image Processing:Image Understand. 58 (1) (1993) 33–48.

[8] M.J. Swain, Dana H. Ballard, Color indexing, Int. J.Comput. Vision 7 (1) (1991) 11–32.

[9] B.S. Manjunath, W.Y. Ma, Texture features for browsingand retrieval of image data, IEEE Trans. Pattern Anal.Mach. Intell. 18 (8) (1996) 837–842.

[10] S. Ravela, R. Manmatha, E.M. Riseman, Scale spacematching and image retrieval, Proc. ARPA ImageUnderstanding Workshop, Palm Springs, CA, 1996, Vol.2, pp. 1199–1207.

[11] C. Schmid, R. Mohr, Local gray level invariants for imageretrieval, IEEE Trans. Pattern Anal. Mach. Intell. 19 (5)(1997) 530–535.

[12] C.E. Jacobs, A. Finkelstein, D.H. Salesin, Fastmultiresolution image querying, Computer GraphicsProceedings (ACM SIGGRAPH), Los Angeles, CA,August 1995, pp. 277–286.

[13] Q. Iqbal, J.K. Aggarwal, Lower-level and higher-levelapproaches to content-based image retrieval, IEEESouthwest Symposium on Image Analysis and Interpret-ation, Austin, TX, April 2000, pp. 197–201.

[14] P.J. Besl, R.C. Jain, Three-dimensional object recognition,ACM Comput. Surv. 17 (1) (1985) 75–145.

[15] F. Arman, J.K. Aggarwal, Model-based object recognitionin dense-range images—a review, ACM Comput. Surv.25 (1) (1993) 5–43.

[16] R.M. Haralick, Linda G. Shapiro, Survey on imagesegmentation techniques, Comput. Vision, Graphics,Image Process. 29 (1) (1985) 100–132.

[17] J. Ashley, R. Barber, M. Flickner, J. Hafner, D. Lee,W. Niblack, D. Petkovic, Automatic and semi-automaticmethods for image annotation and retrieval in QBIC,Proceedings of SPIE: Storage and Retrieval for Imageand Video Databases III, San Jose, CA, February 1995,Vol. 2420, pp. 24–35.

[18] E.G.M. Petrakis, S.C. Orphanoudakis, Methodology forthe representation, indexing and retrieval of images bycontent, Image Vision Comput. 11 (8) (1993) 504–521.

[19] V. Gudivada, V. Raghavan, Design and evaluation ofalgorithms for image retrieval by spatial similarity, ACMTrans. Inform. Systems 13 (2) (1995) 115–144.

[20] S-K. Chang, Q-Y. Shi, C-W. Yan, Iconic indexing by2D strings, IEEE Trans. Pattern Anal. Mach. Intell. 9 (3)(1987) 413–428.

[21] A. Del Bimbo, P. Pala, S. Santini, Visual imageretrieval by elastic deformation of object sketches, IEEESymposium on Visual Languages, St. Louis, MO, 1994,pp. 216–223.

[22] J.R. Smith, S.-F. Chang, VisualSEEk: a fully automatedcontent-based image query system, ACM Multimedia,November 1996, pp. 87–98.

[23] A.P. Pentland, R. Picard, S. SclaroN, Photobook:content-based manipulation of image databases, Int. J.Comput. Vision 18 (3) (1996) 233–254.

[24] J.R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz,R. Humphrey, R. Jain, C.-M. Shu, The Virage imagesearch engine: an open framework for image management,Proceedings of SPIE: Storage and Retrieval for Still Imageand Video Databases IV, San Jose, CA, February 1996,Vol. 2670, pp. 76–87.

[25] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R.Jain, Content-based image retrieval at the end of the earlyyears, IEEE Trans. Pattern Anal. Mach. Intell. 22 (12)(2000) 1349–1380.

[26] W.S. Geisler, Boaz J. Super, Perceptual organization oftwo-dimensional patterns, Psychol. Rev. 107 (4) (2000)677–708.

[27] J.D. McCaNerty, Human and Machine Vision, ComputingPerceptual Organization, Ellis Horwood Limited, NewYork, 1990.

[28] M.D. Levine, Vision in Man and Machine, McGraw Hill,New York, 1985.

[29] R. Nevatia, K.R. Babu, Linear feature extraction anddescription, Comput. Graphics Image Process. 13 (1980)257–269.

[30] R. Weiss, M. Boldt, Geometric grouping applied to straightlines, IEEE International Conference on Computer Visionand Pattern Recognition, 1986, pp. 489–495.

[31] J. Princen, J. Illingworth, J. Kittler, A hierarchicalapproach to line extraction, IEEE International Conferenceon Computer Vision and Pattern Recognition, 1989,pp. 92–97.

[32] R. Mohan, R. Nevatia, Using perceptual organization toextract 3-d structures, IEEE Trans. Pattern Anal. Mach.Intell. 11 (11) (1989) 1121–1139.

[33] G. Reynolds, J. Ross Beveridge, Searching for geometricstructure in images of natural scenes, DARPA ImageUnderstanding Workshop, Los Angeles, CA, 1987, pp.257–271.

[34] D.W. Jacobs, GROPER: A grouping based recognitionsystem for two dimensional objects, in IEEE ComputerSociety Workshop on Computer Vision, Miami Beach,FL, 1987, pp. 164–169.

[35] S. Sarkar, K.L. Boyer, Quantitative measures ofchange based on feature organization: eigenvalues andeigenvectors, Comput. Vision Image Understand. 71 (1)(1998) 110–136.

Page 17: Retrieval by classication of images containing large ...cvrc.ece.utexas.edu/Publications/Q. Iqbal Retrieval...manmade objects using perceptual grouping QasimIqbal,J.K.Aggarwal

Q. Iqbal, J.K. Aggarwal / Pattern Recognition 35 (2002) 1463–1479 1479

[36] S. Noronha, R. Nevatia, Detection and description ofbuildings from multiple aerial images, IEEE InternationalConference on Computer Vision and Pattern Recognition,San Juan, Puerto Rico, 1997, pp. 588–594.

[37] M.A. Maloof, P. Langley, S. Sage, T.O. Binford, Learningto detect rooftops in aerial images, Image UnderstandingWorkshop, New Orleans, LA, 1997, pp. 835–845.

[38] J.E. Hochber, ENects of the Gestalt revolution: the cornellsymposium on perception, Psychol. Rev. 64 (2) (1957)73–84.

[39] F. Attneave, Some information aspects of visualperception, Psychol. Rev. 61 (3) (1954) 183–193.

[40] D.W. Jacobs, What makes viewpoint invariant propertiesperceptually salient?: A computational perspective, in:K.L. Boyer (Ed.), Perceptual Organization for Arti"cial

Vision Systems, Kluwer Academic Publishers, Dordrecht,2000, pp. 121–138.

[41] W. Grimson, On the recognition of curved objects,IEEE Trans. Pattern Anal. Mach. Intell. 11 (6) (1989)632–643.

[42] T.O. Binford, Inferring surfaces from images, Artif. Intell.17 (1981) 205–244.

[43] J. Brian Burns, A.R. Hanson, E.M. Riseman, Extractingstraight lines, IEEE Trans. Pattern Anal. Mach. Intell.8 (4) (1986) 425–455.

[44] A. Etemadi, J-P. Schmidt, G. Matas, J. Illingworth, J.Kittler, Low-level grouping of straight line segments,British Machine Vision Conference, 1991, pp. 119–126.

[45] Alan Gibbons, Algorithmic Graph Theory, CambridgeUniversity Press, Cambridge, 1985.

About the Author—QASIM IQBAL obtained a Bachelor of Science (B.Sc.) degree in electrical engineering from the University ofEngineering and Technology, Lahore, Pakistan, in 1996. He obtained a Master of Science (M.S.E.) degree in electrical engineeringfrom The University of Texas at Austin in 1998. He is currently working towards a Ph.D. in computer vision at the Computer andVision Research Center (CVRC), The University of Texas at Austin. His research interests include computer vision, content-basedimage retrieval, image processing, wavelets and pattern recognition.

About the Author—J.K. AGGARWAL has served on the faculty of The University of Texas at Austin College of Engineeringsince 1964 and is currently the Cullen Professor of Electrical and Computer Engineering and Director of the Computer and VisionResearch Center. His research interests include computer vision and pattern recognition. A Fellow of IEEE since 1976 and IAPRsince 1998, he received the Senior Research Award of the American Society of Engineering Education in 1992, and the 1996Technical Achievement Award of the IEEE Computer Society. He is author or editor of 7 books and 39 book chapters; authorof over 175 journal papers, as well as numerous proceedings papers and technical reports. He has served as Chairman of theIEEE Computer Society Technical Committee on Pattern Analysis and Machine Intelligence (1987–1989); Director of the NATOAdvanced Research Workshop on Multisensor Fusion for Computer Vision, Grenoble, France (1989); Chairman of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition (1993), and President of the International Associationfor Pattern Recognition (1992–1994).