10
Computer Vision and Image Understanding Vol. 75, Nos. 1/2, July/August, pp. 165–174, 1999 Article ID cviu.1999.0771, available online at http://www.idealibrary.com on Image Classification and Querying Using Composite Region Templates * John R. Smith and Chung-Sheng Li IBM T.J. Watson Research Center, 30 Saw Mill River Road, Hawthorne, New York 10532 E-mail: { jrsmith, csli}@watson.ibm.com The tremendous growth in digital imagery is driving the need for more sophisticated methods for automatic image analysis, cat- aloging, and searching. We present a method for classifying and querying images based on the spatial orderings of regions or ob- jects using composite region templates (CRTs). The CRTs capture the spatial information statistically and provide a robust way to measure similarity in the presence of region insertions, deletions, substitutions, replications, and relocations. The CRTs can be used for classifying and annotating images by assigning symbols to the re- gions or objects and by extracting symbol strings from spatial scans of the images. The symbol strings can be decoded using a library of annotated CRTs to automatically label and classify the images. The CRTs can also be used for searching by sketch or example by measuring image similarity based on relative counts of the CRTs. c 1999 Academic Press 1. INTRODUCTION The growing proliferation of digital photographs and video is increasing the need for more sophisticated methods for automat- ically analyzing, cataloging, and searching for digital imagery. Example applications include searching for visual information on the World Wide Web [1–3], digital libraries [4–6], and content adaptation [7, 8]. Many recent content-based query projects have advanced the capabilities for searching for images and video by color, texture, shape, motion, and other features (see [9–13]). These systems are effective in allowing searching by the sim- ilarity of visual features but are often limited in capability of automatically deriving higher semantic level understanding of the images. Several projects have shown that content-based sys- tems can improve searching and attain better understanding the content by capturing spatial information in addition to low-level features [14–16]. However, effective methods are needed for effi- ciently describing and comparing scene information represented by the spatial composition of regions, objects, and features. Since humans perceive images by breaking the scenes into surfaces, regions, and objects, the spatial, temporal, and feature * This paper was presented in part at IEEE CVPR-98 Workshop on Content- based Access to Image and Video Libraries (CBAIVL), June 1998. attributes of the objects, and their relationships to each other, are important characteristics of visual information [17]. Content- based retrieval systems that use global descriptors, such as color histograms [11, 9], do not sufficiently describe the important spatial information. In large collections of photographs, many regions recur, such as those that correspond to blue skies, oceans, grassy regions, orange horizons, mountains, building facades, and so forth. For images, the detection and description of these regions and their spatial relationships is essential in truly char- acterizing the images for searching, classification, and filtering purposes [18, 19]. 1.1. Related Work There are many ways to capture the spatial and region infor- mation in images. Some recent approaches include 2D strings [20, 21], θ - R representations [22], local histograms [23–25], co-occurrence matrices, color correlograms [15], and region tab- ulations [9, 16]. Composition descriptors such as the 2D string and its variants are brittle in the sense that minor changes in region locations can greatly affect the comparison of two im- ages. Descriptors such as θ - R and co-occurrence matrices are not widely applicable due to sensitivity to scale. This can be extremely problematic when comparing images of different res- olutions. Smith and Chang developed a method for computing integrated spatial and feature queries by decomposing the set of region queries into separate subgoals [14]. However, the method does not provide a general way to measure the similarity of two images using spatial and feature information. Other recent methods have been developed for classifying im- ages from low-level features such as color, texture, and shape. Szummer and Picard developed a method for inferring high- level scene properties such as indoor vs outdoor by classifying low-level features such as color and texture [26]. Huang et al. de- veloped a hierarchical image classifier which uses banded color correlograms to assign images to semantic categories. Caelli and Reye developed a single spatio-chromatic feature space for classifying images based on color, texture, and shape [27]. Carson et al. developed a color and texture blob representation of images which is used to classify images using a decision tree [28]. Forsyth et al. developed a body-plans approach for 165 1077-3142/99 $30.00 Copyright c 1999 by Academic Press All rights of reproduction in any form reserved.

Image Classification and Querying Using Composite Region Templates

Embed Size (px)

Citation preview

Computer Vision and Image Understanding

Vol. 75, Nos. 1/2, July/August, pp. 165–174, 1999

Article ID cviu.1999.0771, available online at http://www.idealibrary.com on

Image Classification and Querying Using CompositeRegion Templates∗John R. Smith and Chung-Sheng Li

IBM T.J. Watson Research Center, 30 Saw Mill River Road, Hawthorne, New York 10532E-mail: { jrsmith, csli}@watson.ibm.com

The tremendous growth in digital imagery is driving the needfor more sophisticated methods for automatic image analysis, cat-aloging, and searching. We present a method for classifying andquerying images based on the spatial orderings of regions or ob-jects using composite region templates (CRTs). The CRTs capturethe spatial information statistically and provide a robust way tomeasure similarity in the presence of region insertions, deletions,substitutions, replications, and relocations. The CRTs can be usedfor classifying and annotating images by assigning symbols to the re-gions or objects and by extracting symbol strings from spatial scansof the images. The symbol strings can be decoded using a libraryof annotated CRTs to automatically label and classify the images.The CRTs can also be used for searching by sketch or example bymeasuring image similarity based on relative counts of the CRTs.c© 1999 Academic Press

1. INTRODUCTION

m

i

ho1so

dg

en.

a

attributes of the objects, and their relationships to each other, areimportant characteristics of visual information [17]. Content-

colorntnyeans,des,esear-ing

for-ings5],tab-ings inm-rebees-tinget ofthodtwo

im-ape.gh-ying

oloraellipace27].ionion

The growing proliferation of digital photographs and videoincreasing the need for more sophisticated methods for autoically analyzing, cataloging, and searching for digital imageExample applications include searching for visual informaton the World Wide Web [1–3], digital libraries [4–6], and conteadaptation [7, 8]. Many recent content-based query projectsadvanced the capabilities for searching for images and videcolor, texture, shape, motion, and other features (see [9–These systems are effective in allowing searching by theilarity of visual features but are often limited in capabilityautomatically deriving higher semantic level understandingthe images. Several projects have shown that content-basetems can improve searching and attain better understandincontent by capturing spatial information in addition to low-levfeatures [14–16]. However, effective methods are needed forciently describing and comparing scene information represeby the spatial composition of regions, objects, and features

Since humans perceive images by breaking the scenessurfaces, regions, and objects, the spatial, temporal, and fe

This paper was presented in part atIEEE CVPR-98 Workshop on Contentbased Access to Image and Video Libraries (CBAIVL), June 1998. or

16

isat-

ry.onntaveby

3]).im-fofsys-the

elffi-ted

intoture

-

based retrieval systems that use global descriptors, such ashistograms [11, 9], do not sufficiently describe the importaspatial information. In large collections of photographs, maregions recur, such as those that correspond to blue skies, ocgrassy regions, orange horizons, mountains, building facaand so forth. For images, the detection and description of thregions and their spatial relationships is essential in truly chacterizing the images for searching, classification, and filterpurposes [18, 19].

1.1. Related Work

There are many ways to capture the spatial and region inmation in images. Some recent approaches include 2D str[20, 21],θ −R representations [22], local histograms [23–2co-occurrence matrices, color correlograms [15], and regionulations [9, 16]. Composition descriptors such as the 2D strand its variants are brittle in the sense that minor changeregion locations can greatly affect the comparison of two iages. Descriptors such asθ −R and co-occurrence matrices anot widely applicable due to sensitivity to scale. This canextremely problematic when comparing images of different rolutions. Smith and Chang developed a method for compuintegrated spatial and feature queries by decomposing the sregion queries into separate subgoals [14]. However, the medoes not provide a general way to measure the similarity ofimages using spatial and feature information.

Other recent methods have been developed for classifyingages from low-level features such as color, texture, and shSzummer and Picard developed a method for inferring hilevel scene properties such as indoor vs outdoor by classiflow-level features such as color and texture [26]. Huanget al.de-veloped a hierarchical image classifier which uses banded ccorrelograms to assign images to semantic categories. Cand Reye developed a single spatio-chromatic feature sfor classifying images based on color, texture, and shape [Carsonet al.developed a color and texture blob representatof images which is used to classify images using a decistree [28]. Forsythet al. developed a body-plans approach f

5

1077-3142/99 $30.00Copyright c© 1999 by Academic Press

All rights of reproduction in any form reserved.

A

l

egp

l

io

n

dca

t.

rfs

,aiveea

latnradat

stu

mbolor ortionveral

elo-

gion

bolgen-ma-s

oma settionages

temsing

ringsr toromCRTheir

r re-ted

FIG. 1. Overview of the process for generating composite region templates

166 SMITH

analyzing and classifying photographs, based on the spatiafiguration of regions belonging to the main object [29].

1.2. Overview

In this paper, we describe a method for classifying and quing images by spatial orderings of regions. The composite retemplates (CRTs) descriptor framework provides several resentations for characterizing images:

1. a region string representation (S) for describing spatiaorderings of regions,

2. a CRT descriptor (T) that indicates the instances of regprecedence in spatial scans, and

3. a CRT descriptor matrix (M ) that enables robust searchiand classification based on spatial information.

For one, the CRT descriptors allow images to be searchecomparing the counts of regions with different attributes preing each other in scans of the images. The CRTs allow imto be graphically queried by sketch or by providing an examimage. The CRTs can also be used to describe the protospatial orders of regions that recur throughout a databaseallows unknown images to be annotated by matching sequeof regions to the entries in a library of prototypal CRTs.

1.3. Outline

The paper is organized as follows: in Section 2, we descthe extraction processes for generating the CRT descriptorscolor photographs. In Section 3, we describe the processesearching for images and classifying unknown images bycoding the region strings using the CRT library. In Section 4evaluate the performance of the system in classifying imfrom 10 semantics classes: beaches, buildings, crabs, dfaces, horses, nature, silhouettes, sunsets, and tigers. Wcompare the retrieval effectiveness of the CRT method to mods based on color histograms and texture in retrieving imfrom a database of 893 color photographs.

2. COMPOSITE REGION TEMPLATES

The CRTs are generated by counting the instances of reregion precedence in the region strings. In general, many feaof the regions such as shape, texture, edges, and motioimportant in characterizing the regions. The CRTs are genethat they hold symbols that can represent any type of meta-annotations, descriptor, or feature prototypes, or the index vof entries in a visual feature library. In practical applications,visual features may be derived manually, or automatically,as by sampling the images and clustering the extracted feaas in [28].

2.1. Spatial and Feature Similarity

We propose a region string representation for describingspatial order of regions obtained by scanning the image. C

ND LI

con-

ry-ionre-

n

g

byed-gespleypalThisnces

iberom

forde-wegesers,alsoth-ges

tiveures

arel inata,luesheuchres,

sider a series of regions, where each region is assigned a syvalue. Each symbol value represents an attribute or descriptfeature of the region. In order to compare the spatial informain the images, the symbol strings need to be compared. Seproblems emerge, such as how to deal with:

1. insertions, deletions, substitutions, replications, and rcations of symbols,

2. comparing region strings of different lengths,3. performing partial or substring matching, and4. representing and comparing classes of images using re

strings.

The CRT descriptors provide a way to compare the symsequences in a statistical fashion. The CRT descriptors areerated by mapping the region strings into a CRT descriptortrix, where entry [i, j ] in the CRT descriptor matrix indicatethe countM [i, j ] of the instances in which symboli precededsymbol j in the symbol string.

2.2. Image Analysis

We develop a method for extracting the CRT descriptors frthe images by segmenting and scanning the color regions inof spatial scans. The overall image analysis and CRT generaprocess is summarized in Fig. 1. The system segments the iminto color regions using color back-projection. Then, the sysextracts region strings by scanning the segmented images ua set of scans. Finally, the system consolidates the region stinto the set of composite region templates (CRTs). In ordeclassify the images, the system pools together the CRTs fthe training images from each semantic class to construct alibrary. The system classifies unknown images by decoding tregion strings using the entries in the CRT library.

2.3. Color Region Segmentation

The system segments the images into homogeneous cologions using the method of color back-projection, as illustra

theon-(CRTs) from the images by (1) extracting color regions, (2) scanning the imageregions to generate region strings, and (3) consolidating the region strings.

IMAGE CLASSIFICATION AND QUERYING 167

FIG. 2. Example of the color region extraction process including the steps of color selection from histogramh[n], back-projection of colorsk with h[k]>τc (togive I k[x, y]), filtering, labeling, thresholding (to giveI k[x, y]), and recomposition (to giveIg[x, y]).

e

an

t

a-

all

age

thes ofions.pre-

entedt areit

goodhs.l or-deron-

b q

in Fig. 2. The process extracts the prominent color regionsrecomposites them to generate a segmented image. Sincemately, we assign a color symbol to each of the regions,avoid the clustering of colors, as in [30], and instead usquantized color space. The segmentation process generacolor blob-like representation of the images, as illustratedFig. 3.

The segmentation process initially palettizes each imIrgb[x, y] using a 166-color quantized HSV color space, giviIv[x, y], and computes a 166-bin color histogramh[n], as de-scribed in [9]. The colorsk that are represented withh[k]>τc,where we chooseτc= 0.025, are selected from the histograin order of prominence. Each selected colork is then back-projected to create a back-projected imageI k

d [x, y] that indicatesthe pixels in the image that have the selected colork,

I kd [x, y] =

∑x∈X

∑y∈Y

{1, if Iv[x, y] = k,

0, otherwise.(1)

Each back-projected imageI kd [x, y] is then filtered to reduce

noise and connect nearby regions. The resulting image isintensity thresholded using a thresholdτs= 0.9 to generate a

binary image. The surviving regions are then labeled and sthresholded using a size thresholdτz= 0.025XY per extra-

and, ulti-we

ates ain

geg

m

hen

cted region, whereXY is the size of the image. This givesnew binary imageI k

b [x, y] that gives the regions in the original image I [x, y] that have relative size>τz and have colork. We then color the regions ofI k

b [x, y] with color k to giveI kq [x, y]. After repeating this region extraction process for

k, whereh[k]>τc, the colored imagesI kq [x, y] are then com-

posited together to generate the color region segmented imIg[x, y].

2.4. Region Strings

We use region strings to capture the relative locations ofcolor regions. The region strings are generated in a serievertical scans of the image that order the segmented regNote that this process does not use bounding rectangle resentations of the segmented regions, rather the color-segmimages are scanned directly. We use five vertical scans thaequally spaced horizontally, as illustrated in Fig. 1. Althoughis possible to use any number of scans, five scans provides asampling of the color region content of most color photograpWe use top-to-bottom scans in order to capture the verticadering of color regions. In many photographs, the vertical orof color regions provides a better characterization of the c

ize-tent. For example, in most cases, flipping an image horizontallydoes not greatly alter the semantics of the image; i.e., its class,

168 SMITH AND LI

FIG. 3. Examples of color region segmented images from 10 semantic classes: sunsets, beaches, horses, nature, faces, tigers, houses, crabs, divers, and silhouettes.

if

r

nkef

e

n isatesim-

l “A”

p-, as

such as nature, sunset, beach, stays the same. However, flthe image vertically does greatly perturb the perception oimage content.

Although we consider only color information here, in genewe represent the attribute of interest of each region by its ein the visual feature library, as follows: we denote by a symsk each entryk in the visual feature library. For example, cosidering the visual feature library:D={red, blue, green, blacwhite}, we haves0= red,s1= blue,s2= green, and so forth. Wdefine the region strings, which correspond to a series osymbols, as follows.

DEFINITION 1 (Region string). A region stringS defines aseries ofN symbolsS= s[0]s[1]s[2] · · · s[N−1] that gives the

ppingthe

al,ntrybol-

,

the

order of regions in a scan of an image, wheres[n] is the symbolvalue (i.e., color index value) of thenth successive region in thspatial scan.

In each scan, the symbol value of each successive regioconcatenated onto the scan’s region string. Figure 4 illustrthe region string generation process for two example natureages. In these examples, each symbol value (i.e., symboin Fig. 4b) represents the color index value,k ∈ {0 · · ·165}, ofthe region’s color in the 166-color HSV color space. The toto-bottom scans capture the vertical positions of the regionsillustrated in Fig. 4c.

Notice that in the images in Fig. 4a, the regions correspond-ing to skies and clouds appear above the regions corresponding

IMAGE CLASSIFICATION AND QUERYING 169

FIG. 4. Example of region string extraction and CRT consolidation: (a) two nature images; (b) color segmented regions; (c) region scans and extracted regionTT

d

e

s

is

ols,

-

achle,

atial

strings{Sk}; and (d) CRT descriptor matrixM that gives the count of each CR

to grass and trees. These orderings are reflected in the restrings in Fig. 4c: the symbols “B,” “C,” and “D” (skies anclouds) precede, the symbols “F,” and “G” (grass and trein the top-to-bottom scans. However, the region strings thselves are difficult to use directly to compare images becausthe need to deal with insertions, substitutions, and deletionsymbols.

2.5. Composite Region Templates (CRTs)

We simplify the process of comparing the images by uscomposite region templates (CRTs). The system generate

CRTs by consolidating the region strings, as illustrated in Fig. 4Whereas the region strings describe a series of successive

,= t0t1, in the region strings.

gion

es)m-e of

of

ngthe

bols, the CRTs describe only the relative ordering of symbas follows.

DEFINITION 2 (Composite region template (CRT)). A composite region template (CRT),T, defines a relative ordering ofLsymbols,T= t [0]t [1] · · · t [L−1], whereT[i ] precedesT[i + j ]in a symbol stringS for j > 0.

The CRTs represent instances of symbols preceding eother in a region string or set of region strings. For exampthe CRT,T= sos3s5, indicates that there is an instance ofs0 pre-cedings3 precedings5, such as in symbol stringS= s0s2s1s3s6s5.

The CRTs can be used to describe the overall color–sp

d.

sym-content of the images statistically by counting the frequenciesof the CRTs in the set of region strings. For example, given

A

ge

p

o

mbo

bt

tlut

RTs

con-chthe

ro-

e

es

forngt se-geim-

∑ ∑

170 SMITH

the visual feature libraryD={red, blue, green, black, whiteyellow}, thenT= s2s1 corresponds to the template forgreenprecedingblue. We can count the instanceI (S,T) of each CRT,T, in each symbol stringS. For example, in the symbol strinS= s2s2s1s3s1, we haveI (S,T)= 4, which indicates that therare four instances ofs2 precedings1 (green preceding blue).

In the case thatL = 2, the test forT= t0t1 in a lengthN regionstringS is given by the indicator functionI (T,S), where

I (T,S) =N−1∑n=0

N−1∑m=n+1

{1, if s[n] = t [0] ands[m] = t [1],

0, otherwise.(2)

In general, the CRTs can haveL > 2 dimensions, in whichcase the test for anL-dimensional CRT,T, in a lengthN symbolstringS is given by

I (T,S) =N−1∑m0=0

N−1∑m1=m0+1

· · ·N−1∑

mN−1=LL−2+1

1, if s[l0] = t [0], s[l1] = t [1],

. . . , s[l N−1] = t [L − 1]

0, otherwise.

(3)

2.6. CRT Descriptor Matrices

The CRT test forms the basis for generating a CRT descrimatrix which summarizes all of the counts of the CRTs (T i ) inthe symbol string(s) (S or {Sk}).

DEFINITION 3 (Composite region template (CRT) descriptmatrix). A composite region template (CRT) descriptor mtrix, M [i, j ] of a symbol stringS gives the countI (S, si sj ), ofeachL = 2 dimensional CRT,T= si sj in S.

2.7. Composite Region Template (CRT) Robustness

The CRT descriptor matrices simplify the problem of coparing images since they use only the relative order of symin the regions strings. They are not as sensitive to the insertisubstitutions, and deletions of symbols, as illustrated in FigFor example, if symbols are inserted or deleted from the regstrings, the CRTs that correspond to the two endpoint symremain intact. Potentially, many noisy CRTs are generated wia class of images that correspond to minor differences in thetures and positions of regions. However, many of the imporCRTs remain dominant for each class of images. This is iltrated for the CRTs in Fig. 4d that are generated from the naimages, as described next.

The matrix in Fig. 4d shows the CRT values for the two nture images. The value of the each CRT,T= t0t1, whereT i ∈{A, B,C, D, E, F,G, H, I }, is given by the entry in the matrixM [t0, t1]. Although the two nature images have minor diffe

ences, and the region strings differ, some of the CRTs, i.e., “B“BG,” “CF,” “CG,” “DF” are prevalent in both. Regardless of the

ND LI

,

tor

ra-

-olsns,

. 4.ionolshinfea-ants-ure

a-

r-

minor differences between the images in this class, these Care found to occur with high likelihood.

2.8. Composite Region Template (CRT) Library

Given a set of semantic classes of images, the systemstructs a CRT library by pooling together the CRTs in easemantic class, where for each class, the CRT library giveslikelihood that each CRT is found in that class. The pooling pcess is carried out by computing the frequencyP(T i ) of eachCRT,T i , in a set of region strings{Sj } as

P(T i ) =∑

j

I (T i ,Sj ). (4)

The frequency of each CRT,T i , in the set of region strings{Sj }kfrom semantic classCk is given byP(T i |Ck), where

P(T i |Ck) =∑∀ j Sj∈Ck

I (T i ,Sj ). (5)

The pooled CRTs form a CRT library that holds theP(T i |Ck)values for each classk, as follows.

DEFINITION 4 (CRT library). A composite region templat(CRT) library is given by the set of tuples

[T i , P(T i ); P(T i |C0), P(T i |C1), . . . , P(T i |CK−1)],

whereK is the number of semantic classes.

The P(T i )’s reflect the frequencies of the CRTs in all class(Eq. (4));P(T i |Ck)’s reflect the frequencies within each classk(Eq. (5)).

3. IMAGE CLASSIFICATION AND QUERY

The system uses the region strings and the CRT libraryclassifying and retrieving the images. In applications involviclassification, the system assigns each image to the closesmantic class in the CRT library, as described below. In imaretrieval applications, the system retrieves the most similarages to the query image.

3.1. Image Classification

The system assigns an image to classl based on its{T′i }’s asfollows:

1. For eachT′i from the image,P(Ck|T′i ) is computed fromthe entires in the CRT library from

P(Ck|T′i ) =P(T′i |Ck)

P(T′i ). (6)

2. The system classifies the image into classl when

F,” ∀ l 6= k,i

P(Cl |T′i ) ≥i

P(Ck|T′i ). (7)

igers

IMAGE CLASSIFICATION AND QUERYING 171

TABLE 1Results of the Image Semantics Classification Experiment with 10 Semantic Classes Using the CRT Method

Compared to Color Histogram-Based Classification

Overall Beach Buildings Crabs Divers Faces Horses Nature Silhouettes Sunsets T

Total # images 357 14 56 9 33 55 26 46 41 46 31# training images 91 7 10 4 10 10 10 10 10 10 10# test images 266 7 46 5 23 45 16 36 31 36 21

# correctly classified (CRTs) 188 6 30 5 23 19 14 20 20 31 21# correctly classified (colorhist) 179 5 19 5 23 16 13 24 31 24 19% correctly classified (CRTs) 70.7 85.7 65.2 100 100 42.2 87.5 55.6 64.5 86.1 100

as.

r

o

tit

e

hs

php

e 1.the

ingr tonted

for

rall,las-as

es,par-inguettems.s, a.

temtheThe

imeacesweread asedrity

valage. Fore

igned

% correctly classified (colorhist) 67.3 71.4 41.3 10

That is, classCl best explains the CRTs represented in the spaorder of regions in the unknown image.

3.2. Image Query

In image search applications, the images are queried bon spatial sequences of color regions. The system returntop-ranked images that are most similar to the query imageconsider that each target image has a region stringSt , or CRTdescriptor matrixM t . The query can be formed from the queimage by extracting its set of region strings{Sj }q and computingthe descriptor matrixMq, as described above. Alternatively, thuser can sketch a query image by positioning color regionsgrid [9]. The system can then computeMq based on the regionlayout.

In order to compare the images based on spatial informathe system computes the similarity ofMq to those of the targeimagesM t as

θq,t =∑

i

(1

Mq[i, i ]M t [i, i ]

)∑j

Mq[i, j ]M t [i, j ]. (8)

The query system retrieves the images with highestθq,t .

4. EVALUATION

We evaluate the CRT method by evaluating its performancclassifying 266 unknown images from 10 semantic classesin retrieving images from a collection of 893 color photograpWe also demonstrate the CRT method in retrieving images ugraphical querying.

4.1. Experimental Setup

The experiments used color images obtained from ExSoftware.1 We used 357 images from the image collection tbelong to 10 semantic classes. For the image retrieval exments we used an additional 536 images that belong outsidthese 10 semantics classes. We divided the 357 semantic im

1 Expert Software, Inc., 800 Douglas Rd., Coral Gables, FL 33134.

0 100 35.5 81.3 66.7 100 66.7 90.5

tial

sedthe

We

y

en a

on,

inands.ing

ertateri-e ofages

into nonoverlapping training and test sets according to TablWe used 91 training images to generate the CRT library andremaining 266 images for testing the system.

We also performed the classification and retrieval us166-bin color histograms defined in HSV color space in ordeprovide a comparison [9]. Using the histograms, we represeeach semantic class by the centroid of the color histogramsthe training images in the class.

4.2. Semantic Classification Results

The classification results are summarized in Table 1. Ovethe semantics decoding system using CRTs provided a csification rate of 0.71. The color histogram performance wslightly lower at 0.67. However, for most of the image classthe CRT method performed better than color histograms. Inticular, the CRT method was significantly better at classifythe images of beaches, buildings, and sunsets. For the silhoimages, the CRTs performed worse than the color histograIn this case the dominant property of the silhouette imagelarge black background, was not captured well by the CRTs

The confusion matrix for the semantics classification sysis given in Table 2. We can see that with the exception offaces and nature images, few misclassifications resulted.faces images were classified correctly only 42.2% of the tand were misclassified as horses 24.4% of the time. The fclass was the most challenging because the face regionsoften only a small part of the image and overall, the images hlarge variety of backgrounds. The nature images were confuwith beaches 19.4% of the time because of the high similaof the scenes.

4.3. Retrieval Effectiveness Results

We used the full set of 893 images to evaluate the retrieeffectiveness of the CRT method. We analyzed three imqueries: sunset images, nature images, and diver imageseach benchmark query, we preassigned each target imagn asubjective relevanceVn ∈ {0, 1} to the query as follows: theimages in the same class as the query image were ass

a relevanceVn= 1, and the remaining images were assigneda relevanceVn= 0. Each of the relevant images was used in

172

the CRT mewith a precis

SMITH AND LI

TABLE 2The Confusion Matrix Results of the Image Semantics Classification

Experiment with 10 Semantic Classes

Label→ Beach Buildings Crabs Divers Faces Horses Nature Silhouettes Sunsets Tigers

Beach 6 0 0 0 0 0 1 0 0 0Buildings 6 30 0 2 0 1 1 0 4 2Crabs 0 0 5 0 0 0 0 0 0 0Divers 0 0 0 23 0 0 0 0 0 0Faces 5 1 0 0 19 11 1 0 5 3Horses 0 0 0 0 0 14 0 0 0 2Nature 7 3 0 0 0 3 20 0 1 2Silhouettes 0 1 0 0 1 1 6 20 1 1Sunsets 0 0 0 0 3 1 0 0 31 1

-

i

h1bi

R

ro

a

in-ough(seees,olors inge,herfor

hi-lor

sing

Tigers 0 1 0 0

turn to query the database ofN= 893 images. Each query retrieved the target images in rank order (total ordering ofN im-ages).

Based on the set of queries using each of the imagesclass, we computed the average retrieval effectiveness in teof average precision and recall. For retrieved image from quj with rankk, we count the number of detections, false alarmand misses, as

• Ajk =

∑k−1i=0 Vi (detections)

• B jk =

∑k−1i=0(1− Vi ) (false alarms)

• C jk =

∑N−1i=0 Vi − Ak (misses)

and we compute the recall and precision for queryj as

• recall: Rjk = Aj

k/(Ajk + C j

k )• precision: P j

k = Ajk/(Aj

k + B jk ).

We then computed the average recall and precision overqueries within the same class, as

• average recall: Rk=∑

j Rjk• average precision: Pk=

∑j P j

k .

We repeated the query experiments using CRTs, colortograms, and texture. We used color histogram that havebins in HSV color space, as defined in [9]. We used a glotexture descriptor based on a nine-dimensional vector that gthe spatial-frequency energy in nine wavelet subbands ofimage, as defined in [31]. The results are given as follows:

Sunsets. In the sunset image queries (46 queries), the Cmethod showed better average retrieval effectiveness thancolor histogram and texture methods. For example, in ordeobtain half (23) of the sunset images, 61 images needed tretrieved using CRTs, compared to 118 using color histograand 333 for texture.

Divers. In the diver image queries (33 queries), the CRmethod also showed better average retrieval effectiveness. U

thod, 91% (30) of the diver images were retrievion of 0.88 (34 retrieved images). Using color hi

0 0 0 0 0 20

n armserys,

the

is-66alvesthe

Tthetobe

ms

Tsing

tograms, 91% (30) of the diver images were retrieved withprecision of 0.57 (52 retrieved images).

Nature. The nature image queries (46 queries) were mostteresting because the CRT method performed better even thcolor histograms were better at classifying the nature imagesTable 1). Using the CRT method, in the first 20 retrieved imagon average, 8.2 were nature images, compared to 6.1 for chistograms. Figure 5 plots the average retrieval effectivenesterms of precision vs recall for the nature queries. On averaafter the first two returned images, the CRT-method gave higprecision than color histogram and texture-based methodsthe same value of recall.

4.4. Graphical Query

The CRT method allows fast matching of images in grapcal querying, as illustrated in Fig. 6. The user places the co

FIG. 5. Average retrieval effectiveness of 46 queries for nature images u

eds-three methods—CRTs, color histograms, and texture. The points marked by “o”give the average precisionPk and recallRk for thekth match, wherek= 20.

IMAGE CLASSIFICATION AND QUERYING 173

laced on a

n

8

errp

s

g

Weand

im--

and-

ries,

d

ageso,

het-

FIG. 6. Example graphical image queries in which the CRT descriptors aquery grid.

regions on a grid to construct a coarse depiction of the sceof interest. The set of query CRTs are obtained by scanningquery grid and generating and consolidating the region strifor the query. The matching is then performed directly betwethe query CRTs and those for the target images using Eq. (

For example, in the first query in Fig. 6, the user places tregions, one for blue sky, and another for green field onquery grid. The retrieved images have the best match toquery in terms of the composition of the color regions. The bsix matches are illustrated on the right in Fig. 6. In the nexample query, the user places a pale blue region above a bregion. The scenes in the retrieved images match the queterms of color and location of the regions. In the last examquery, the user places a blue region above a tan region. Thefive retrieved images in which the blue region matches toand the tan region matches to sand.

5. SUMMARY

We presented a method for classifying and retrieving imausing composite region templates (CRTs) generated from amatically extracted strings of color regions. The system extra

the region strings by scanning the segmented color regionsseries of scans. Images are matched by consolidating the re

re generated from the color region layout obtained from the color regions p

nesthegsen).

wothetheestxtowny inlefirstky

esuto-cts

strings into CRT descriptor matrices and comparing them.demonstrated that the system performs well in classifyingretrieving images from 10 semantic classes.

REFERENCES

1. J. R. Smith and S.-F. Chang, Visually searching the Web for content,IEEEMultimedia Mag.4(3), 1997, 12–20.

2. S. Sclaroff, L. Taycher, and M. La Cascia, ImageRover: A content-basedage browser for the World Wide Web, inProc. IEEE Workshop on Contentbased Access of Image and Video Libraries, June 1997.

3. V. Athitsos, M. J. Swain, and C. Frankel, Distinguishing photographsgraphics on the World-Wide Web, inProc. IEEE Workshop on Contentbased Access of Image and Video Libraries, June 1997.

4. S.-S. Chen, Content-based indexing of spatial objects in digital libraJ. Visual Commun. Image Rep.7(1), 1996, 16–27.

5. W. Wolf, Y. Liang, M. Kozuch, H. Yu, M. Phillips, M. Weekes, anA. Debruyne, A digital video library on the World Wide Web, inProc.ACM Int. Conf. Multimedia (ACMMM), November 1996, pp. 433–434.

6. J. R. Smith, Digital video libraries and the Internet,IEEE Commun. Mag.37(1), 1999, 92–99. [Special issue on the Next Generation Internet]

7. J. R. Smith, R. Mohan, and C.-S. Li, Content-based transcoding of imin the Internet, inIEEE Proc. Int. Conf. Image Processing (ICIP), ChicagIl, October 1998.

8. J. R. Smith, R. Mohan, and C.-S. Li, Transcoding Internet content for

in agion

erogenous client devices, inProc. IEEE Inter. Symp. on Circuits and Syst.(ISCAS), June 1998. [Special session on Next Generation Internet]

s

e

e

u

r

es

e

ial

ms

ine &

ing

eg-

ses,

ur,

ageand

nd, in

174 SMITH A

9. J. R. Smith and S.-F. Chang, VisualSEEk: A fully automated content-baimage query system, inProc. ACM Intern. Conf. Multimedia (ACMMM),Boston, MA, November 1996, pp. 87–98.

10. J. R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. HumphreR. C. Jain, and C. Shu, Virage image search engine: An open framewfor image management, inSymposium on Electronic Imaging: Sciencand Technology—Storage & Retrieval for Image and Video Databases,IS&T/SPIE, Vol. 2670, pp. 76–87, January 1996.

11. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. DomM. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, Quby image and video content: The QBIC system,IEEE Comput.28(9), 1995,23–32.

12. V. E. Ogle and M. Stonebraker, Chabot: Retrieval from a relational databof images,IEEE Computer28(9), 1995, 40–48.

13. A. Pentland, R. W. Picard, and S. Sclaroff, Photobook: Tools for contebased manipulation of image databases, inIn Storage and Retrieval Storage& Retrieval for Still Image and Video Databases II, IS&T/SPIE, Vol. Proc.SPIE 2185, February 1994. [MIT TR # 255]

14. J. R. Smith and S.-F. Chang, Integrated spatial and feature image qMultimedia Systems7(2), 1999, 129–140.

15. J. Huang and R. Zabih, Combining color and spatial information for contebased image retrieval, inEuropean Conf. on Digital Libraries, Septembe1998.

16. A. Del Bimbo and E. Vicario, Using weighted spatial relationships in rtrieval by visual content, inProc. IEEE Workshop on Content-based Acceof Image and Video Libraries, June 1998.

17. R. O. Duda and P. E. Hart,Pattern Classification and Scene Analysis, Wiley,New York, 1973.

18. J. R. Smith and S.-F. Chang, Local color and texture extraction and stial query, in IEEE Proc. Int. Conf. Image Processing (ICIP), LausannSwitzerland, September 1996.

19. J. R. Smith and Chung-Sheng Li, Decoding image semantics using coposite region templates, inIEEE CVPR-98 Workshop on Content-baseAccess to Image and Video Databases, June 1998.

ND LI

ed

y,ork

IV

,ry

ase

nt-

ery,

nt-

-s

pa-,

20. S.-K. Chang, Q. Y. Shi, and C. Y. Yan, Iconic indexing by 2-D strings,IEEETrans. Pattern Anal. Mach. Intell.9(3) 1987, 413–428.

21. C. C. Chang and S. Y. Lee, Retrieval of similar pictures on pictordatabases,Pattern Recog.24(7), 1991, 21–22.

22. V. N. Gudivada and V. V. Raghavan, Design and evaluation of algorithfor image retrieval by spatial similarity,ACM Trans. Inform. Systems13(2),1995.

23. M. Stricker and A. Dimai, Color indexing with weak spatial constraints,Symposium on Electronic Imaging: Science and Technology—StoragRetrieval for Image and Video Databases IV, IS&T/SPIE, Vol. 2670,pp. 29–41, 1996.

24. F. Ennesser and G. Medioni, Finding Waldo, or focus of attention uslocal color information,IEEE Trans. Pattern Anal. Mach. Intell.17(8),1995.

25. T.-S. Chua, S.-K. Lim, and H.-K. Pung, Content-based retrieval of smented images, inProc. ACM Intern. Conf. Multimedia (ACMMM), Octo-ber 1994.

26. M. Szummer and R. W. Picard, Indoor-outdoor image classification, inIEEEIntl. Workshop on Content-based Access of Image and Video DatabaJanuary 1998.

27. T. Caelli and D. Reye, On the classification of image regions by colotexture and shape,Pattern Recog.26(4), 1993.

28. C. Carson, S. Belongie, H. Greenspan, and J. Malik, Region-based imquerying, inProc. IEEE Workshop on Content-based Access of ImageVideo Libraries, June 1997.

29. D. A. Forsyth, J. Malik, M. M. Fleck, T. Leung, C. Bregler, C. Carson, aH. Greenspan, Finding pictures of objects in large collections of imagesProceedings, International Workshop on Object Recognition, IS&T/SPIE,April 1996.

30. S. Tominaga, Color classification of natural color images,COLOR Res.Appl., 1992.

m-d31. J. R. Smith and S.-F. Chang, Transform features for texture classification

and discrimination in large image databases, inIEEE Proc. Int. Conf. ImageProcessing (ICIP), Austin, TX, November 1994, pp. 407–411.