10
Augmented Image Histogram for Image and Video Similarity Search* Y. Chen andE. K Wong Department of Computer and Information Science Polytechnic University Brooklyn, NY I 1201 Abstract Image histogram is an image feature widely used in content-based image retrieval and video segmentation. It is simple to compute yet very effective as a feature in detecting image-to-image similarity, or frame-to- frame dissimilarity. While the image histogram captures the global distribution of different intensities or colors well, it does not contain any information about the spatial distribution of pixels. In this paper, we propose to incorporate spatial information into the image histogram by computing features from the spatial distance between pixels belonging to the same intensity or color. In addition to the frequency count of the intensity or color, the mean, variance, and entropy of the distances are computed to form an Augmented Image Histogram. Using the new feature, we preformed experiments on a set of color images and a color video sequence. Experimental results demonstrate that the Augmented Image Histogram performs significantly better than the conventional color histogram, both in image retrieval and video shot segmentation. Keywords: content-based image retrieval, video shot segmentation, Augmented Image Histogram, similarity search, spatial information 1. INTRODUCTION Image histogram is a simple but effective feature for computing similarity between images or video frames in content-based image and video retrieval systems [1, 2, 3]. The image histogram is an N-dimensional vector {H(f, i); i = 1,2, . • , N} where N is the number of intensity levels or colors, and H(fi) is the number of pixels having intensity or color i in imagef The image histogram contains information about the global distribution of intensity values or colors in an image or a video frame. Image histogram is a global image feature that is invariant to image rotation, translation and viewing axis [4]. The image histogram, however, does not contain information about the spatial locations or distributions of pixels in an image. It is possible that two images having different scene content have the same or similar image histograms. This results in false retrievals when performing a similarity search in an image database, or missed segmentation when performing video shot segmentation. One way to include some form of spatial information is by use of "local histograms." In this approach, an image is divided into smaller overlapping or non-overlapping sub-regions [5][6], and a local histogram is computed in each sub-region. The spatial information obtained using this approach, however, is dependent on how the image is partitioned into smaller sub-regions. The local histograms change with different size and location of the sub-regions. Moreover, local histograms obtained from image sub-regions are sensitive to image rotation *This work is supported by the National Science Foundation under the STIMULATE program (Grant IIS-9619114.) Further author information - Y. C.: [email protected] E. K. W. (correspondence): wongvision.poly.edu Part of the IS&T/SPIE Conference on Storage and Retrieval for lmaae and Video Databases VII San Jose. California • January 1999 523 SPIE Vol. 3656 • 0277-786X/98/$1 0.00

SPIE Proceedings [SPIE Electronic Imaging '99 - San Jose, CA (Saturday 23 January 1999)] Storage and Retrieval for Image and Video Databases VII - Augmented image histogram for image

Embed Size (px)

Citation preview

Augmented Image Histogram for Image and Video Similarity Search*

Y. Chen andE. K Wong

Department of Computer and Information SciencePolytechnic UniversityBrooklyn, NY I 1201

Abstract

Image histogram is an image feature widely used in content-based image retrieval and video segmentation.It is simple to compute yet very effective as a feature in detecting image-to-image similarity, or frame-to-frame dissimilarity. While the image histogram captures the global distribution of different intensities orcolors well, it does not contain any information about the spatial distribution of pixels. In this paper, wepropose to incorporate spatial information into the image histogram by computing features from the spatialdistance between pixels belonging to the same intensity or color. In addition to the frequency count of theintensity or color, the mean, variance, and entropy of the distances are computed to form an AugmentedImage Histogram. Using the new feature, we preformed experiments on a set of color images and a colorvideo sequence. Experimental results demonstrate that the Augmented Image Histogram performssignificantly better than the conventional color histogram, both in image retrieval and video shotsegmentation.

Keywords: content-based image retrieval, video shot segmentation, Augmented Image Histogram,similarity search, spatial information

1. INTRODUCTION

Image histogram is a simple but effective feature for computing similarity between images or video framesin content-based image and video retrieval systems [1, 2, 3]. The image histogram is an N-dimensionalvector {H(f, i); i = 1,2, • . • , N} where N is the number of intensity levels or colors, and H(fi) is the numberof pixels having intensity or color i in imagef The image histogram contains information about the globaldistribution of intensity values or colors in an image or a video frame. Image histogram is a global imagefeature that is invariant to image rotation, translation and viewing axis [4].

The image histogram, however, does not contain information about the spatial locations or distributions ofpixels in an image. It is possible that two images having different scene content have the same or similarimage histograms. This results in false retrievals when performing a similarity search in an image database,or missed segmentation when performing video shot segmentation. One way to include some form ofspatial information is by use of "local histograms." In this approach, an image is divided into smalleroverlapping or non-overlapping sub-regions [5][6], and a local histogram is computed in each sub-region.The spatial information obtained using this approach, however, is dependent on how the image ispartitioned into smaller sub-regions. The local histograms change with different size and location of thesub-regions. Moreover, local histograms obtained from image sub-regions are sensitive to image rotation

*This work is supported by the National Science Foundation under the STIMULATE program (Grant IIS-9619114.)

Further author information -Y. C.: [email protected]. K. W. (correspondence): wongvision.poly.edu

Part of the IS&T/SPIE Conference on Storage and Retrieval for lmaae and Video Databases VII

San Jose. California • January 1999 523SPIE Vol. 3656 • 0277-786X/98/$1 0.00

524

and translation. For example, when there is a tracking motion of the camera, the global histograms of twoconsecutive video frames may be similar but the local histograms could be very different, due to the factthat each local window contains a different and separate part of the scene. In [7], a measure called colorcoherence vector (CCV) is used to classify pixels as either coherent or incoherent. A coherent pixel is partof a large group of pixels of the same color, while an incoherent pixel is not. Recently, in [8], a featurecalled color correlogram is used to express how the spatial correlation of pairs of colors changes withdistance. The color correlogram reduces to an autocorrelogram when the pairs of colors are identical. Intheir approach, the entire autocorrelogram (over all distance values) is used to compute the distancebetween two images.

In this paper, we propose a new Augmented Image Histogram (or Augmented Histogram) that captures the"spatial distribution" of pixels, in addition to the intensity or color count. Since the spatial distribution iscomputed globally on the image, it is relatively insensitive to image rotation and translation. The conceptsand formulations presented herein apply to both grayscale and color images. In this paper, we focus ouranalysis and discussions on color histograms, and an Augmented Image Histogram will be referred to asAugmented Color Histogram (or simply Augmented Histogram) when applied to color images. In Section2, we present details on the Augmented Histogram. Section 3 gives the experimental results. Finally, inSection 4, we give conclusions and future work.

2. AUGMENTED HISTOGRAM

2.1 Definition

The Augmented Color Histogram H can be defmed as a 4-dimensional vector

H*(f,i) = (C(i),M(i),V(i),E(i)), for I = 1 to N (1)

where C(i) is the number of pixels with color M(i), V(i), and E(i) are measures computed from the spatialdistances of pixels, and N is the total number of colors. The spatial distance used could be either Euclideandistance or city-block distance, with the latter more computationally efficient. Let dist(pk P1) be the

spatial distance from pixel Pk to pixel Pi . Throughout this paper, we will use the city-block distance,defmed as

distCb(pk,pl)=xk—xlI+Iyk—yl(). (2)

The mean distance from pixel Pk to all pixels of color i is defined as

1 C(i)m(i,k)= distb(pk,pl)

C(z)—l i=i (3)

where C (i) represents the total number of pixels of color i. Mean distance m(i, k) could be considered tobe a feature of pixel Pk' that characterizes how far Pk is from the rest of the pixels with color i. M(i) isthen defmed as the average of the mean distance for all pixels of color i; that is,

1 C(i)M(i) = m(i, k)

C(i) k=1

1 C(i)C(i)= . .

* >distb(pk,pl)C(i)(C(i) — 1) k=1 l1

2 C(i) C(i)= . . * : dlstcb(pk,pl) (4C(i)(C(i) — 1) k=1 l=k+1 .

The variance of distance from pixel Pk to other pixels of color I is defmed as

1 C(i)v(i, k) = : (distb(Pk 'P1 ) — m(i,k))2 . (5)C(i) - 1 =j

Then the average ofthe variance of distance V(i) for all pixels Pk of color i can be computed as

1 C(i)V(i) = —:— v(i, k)

C(i) k=1

1 C(i)C(i)— . . * : (distCb(pk,pl)--m(i,k))2C(i)(C(i) — 1) k=1 1=1 . (6)

A low value of M(i) and V(z) indicates that pixels of color i are concentrated in a local region on an image,as found in an image having local regions of uniform color. On the other hand, a high value of M(i) andV(i) indicates that pixels belonging to color I are more randomly distributed over the entire image space,as in the case of a scene with many textured regions

The distance entropy E(i) of color I is defmed as:

E(i) = — p(i,d) log p(i, d) (7)d

where p(i,d) is the probability of occurrence of distance d, from pixel Pk to other pixels of the same

color, computed among all pixels Pk ; that is,

p(i d) = f(d)(8)

C(i)(C(i) — 1)

where f(d) is the frequency of distance d. The distance entropy represents the randomness of distancevalues. It is at a maximum when all non-zero distance values have equal probability.

The Augmented Histogram therefore captures the spatial distribution of pixels of color i relative to eachother, instead of containing information about absolute spatial location of the pixels. The distinguishingability ofthe Augmented Histogram is illustrated in Figure 1, which shows different spatial distributions ofnine pixels of the same color. The images are of size 1 7 x17 and each pixel is represented by a blacksquare block. The augmented histogram values (C(i), M(i), V(i), E(i)) are computed and shown inparenthesis below each image. The distance histogram, which shows the frequency of occurrence ofdistance values, is shown for each image in Figure 2. Note that the mean distance of Figures 1(a) and 1(b)are very close to each other, yet their variance and entropy are quite different. The image in Figure 1(c) hasthe smallest distance mean and variance, as all the pixels are located in a local region. Also note that the

525

526

images in Figures 1 (a) and (c) have the same entropy value. This is because they have similar distancehistograms as shown in Figures 2 (a) and (c).

2.2 Block-Based Approach

The direct computation of H according to the defmitions above requires the computation of distancesbetween every pixel Pk and every pixel p, of color 1. This results in a worst case computational

complexity of order O((P x Q)2 ) when all pixels are of the same color; here, P x Q is the size of theimage. An alternative approach is to approximate M(i) and V(i) by dividing an image into small blocks ofsize K x L . The distance between each pair of blocks can be pre-computed and stored. To approximateM(z), we first compute

dist sum(A, B, i) = count(A,1) * count(B, i) * dist(A, B) (9)

for all pairs of blocks A and B, where count(A, i) and count(B, i) represent the number of pixels of color ithat are located in blocks A and B, respectively, and dist(A,B) is the pre-computed distance between blocksA and B. The approximation for M(i) is then computed by adding up dist_sum(A,B, ) for all block-pairs Aand B, then divided by the total number of pixel-pairs. V(i) and E(i) can be similarly approximated. Usingthis method, the amount of computation is greatly reduced, yet it provides a good approximation to thecomputation of the Augmented Histogram. The worst case computational complexity using the block-

based approach is O((P/K x Q/L)2 ) , when all pixels in the image are ofthe same color.

In computing the distance between two blocks, the center of the block is used as an approximation to thelocation of all pixels within the block. The resulting error in the computed distance between two pixels canbe shown to lie within the interval [-K-L, K+L], where K and L are the width and height of the blocks.When the number of pixels is large, we expect the negative and positive errors to cancel each other out, sothe estimated M(z), V(i), and E(i) will be close to the true values. The experimental results we are going todescribe in Section 3 confirm this.

2.3 Similarity Measure

In our program implementation, we treat the Augmented Histogram in Equation (1) as consisting of fourseparate histograms: the pixel frequency histogram C(i), the distance mean histogram M(i), the distancevariance histogram V(i), and the distance entropy histogram E(i), each indexed with color i along thehorizontal axis. An overall similarity measure can then be defined for the Augmented Color Histogram

H* as a linearly combination of its four components:

sa*sc+b*sm+c*sv+d*se (10)

where sc , sm , sv3 , and sef are the computed similarities between the query image Q and the

ph image in a database consisting of T images, and a, b, c, and d are the weights assigned to them. Whenapplied to video shot segmentation, the similarity (or dissimilarity) is computed between two successiveimage frames in a video sequence. There exists in literature [9j different metrics for computing similaritybetween color histograms for image and video retrieval. These include: (1) sum of absolute histogramdifference, where we sum up the absolute value of histogram difference over all color value i; and (2)normalized intersection of histograms, where we sum up the minimum of two histograms over all colorvalue i, then divided by the total number of pixels in the image. We use metric (1) in the computation of

sc3, Sm3, sv3, and se ; that is,

sci =CQ(i)-Cf(i) (11)

sm MQ (i) M (i) (12)

svj =VQ(i)-Vj(i) (13)

se >EQ(i)-Ef(i) (14)

As in [10], we normalize the similarity sequence using Gaussian normalization by assuming each similaritysequence to be a Gaussian sequence. Suppose we have a similarity sequence of S ={s1 ,s2 ,s3 , . • , sr},where T is the total number of images in the database, and s1 is any one of the four similarity measures

(sc , sm , sv or se ,) we first compute the mean and standard deviation of the sequence. We thennormalize the original sequence by the following equation:

Si = : (s—m)/(3a)+ 1

J (15)

where m is the mean and ° is the standard deviation. After normalization, the probability of a similarityvalue being in the range of [0,1] is approximately 99%. Any value outside of the range could be mappedinto the value 0 or 1 . The normalization process ensures the equal emphasis of the four similarity measuresin the computation ofthe overall similarity.

The choice for the weights a, b, c, and d is dependent on the class of images we are working with, andcould be assigned subjectively or determined experimentally using a set of training images. The weightsallow us to assign different degree of importance to the four components.

3. EXPERIMENTAL RESULTS

We implemented a program for image retrieval based on the proposed Augmented Histogram and appliedit to a set of color images. Experimental results show that the Augmented Histogram has betterdiscrimination power than conventional color histogram. There are test cases when the conventional colorhistogram fails, but the Augmented Image Histogram succeeds in differentiating them. We alsoimplemented a program for video shot segmentation based on the proposed Augmented Histogram.Encouraging results were obtained when the program is applied to a color video sequence. We have chosena value of 1 for the weights a, b, c, and d in Equation (10) for both experiments. We obtained good resultsusing this value.

3.1 Image Retrievals

A total of 854 color images, each of size 192 x 128, were used in our experiment. These images wereselected from 12 different categories of the Corel Photo Collection [11]. Each color image is quantized into64 colors. Since spatial features are used in our method, we selected only the horizontally oriented imagesfrom each category for consistency. We also spilt the category "bear" into "brown bear" and "polar bear",thus resulting in a total of 13 different categories as shown in Table 1. Column 2 of the table lists the

527

528

number of images in each category. The augmented histogram for each image is computed offline andstored for later similarity computation during the search process. In our experiment, each image in thedatabase is used as a query image and tested against t rest of the database. We performed experimentsusing both the Augmented Histogram and the conventional color histogram. In computing pixel-to-pixeldistances, we used both the direct computation method based on the defmitions in Section 2.1, and theblock-based method described in Section 2.2. The block size used in the block-based approach is 20 x 20,resulting in a total of 10 x 7 blocks for the whole image. The precision and recall for different scope valuesare then computed and tabulated. Here, precision is defmed to be the total number of retrieved relevantimages divided by the total number of retrieved images (or scope); recall is defmed to be the total numberofretrieved relevant images divided by the total number ofrelevant images [12].

Table 1 lists the results for a scope value of 5 using the direct distance computation method and the block-based distance computation method. As shown in the table, the precision and recall obtained usingAugmented Histogram (both direct and block-based methods) significantly improve over those obtainedusing the conventional color histogram for all categories. Figure 3 depicts a plot of average precision vs.scope, computed over all categories. As shown in the figure, Augmented Histogram outperformsconventional color histogram for all values of scope. For the direct distance computation method, theimprovement ranges from 16.27% to 26.51% for scope value ranging from 1 to 20. For the block-basedmethod, the improvement ranges from 20.24% to 25.86% for the same range of scope values. The figurealso shows that the block-based distance computation method compares favorably with the direct approachin performance, with the two curves almost the same. Figure 4 shows two retrieval examples when theAugmented Histogram outperforms the conventional color histogram. On the left of the figure are thequery images. On the right are the retrieved relevant images that are ranked highly when using AugmentedHistogram, but ranked lowly when the conventional histogram method is used.

3.2 Video Shot Segmentation

We applied our augmented histogram approach to video shot segmentation. Here the goal is to detect theboundaries between shots based on the computed dissimilarity between each successive pair of videoframes. A shot boundary is detected when the dissimilarity exceeds a certain threshold. A color videosequence of news broadcast was used in our experiment. The total number of frames in the sequence is1,588 and each frame is quantized into 64 colors. Both the direct distance computation method and theblock-based method are used in the experiment. The detected shot boundaries are compared to the actualshot boundaries (determined manually) and the results are tabulated in Table 2. The table also containsresults obtained when conventional color histogram is used. As shown in Table 2, the results obtainedusing the direct distance computation method and the block-based method are exactly the same. Bothmethods correctly detected 95.83% of the shot boundaries, whereas the conventional color histogramapproach has a detection rate of 87.50%. Two of the shot boundaries missed by the conventional colorhistogram method are those that involve dissolve transitions. A dissolve transition occurs when one scenegradually disappears while another gradually appears. However, using Augmented Histogram, both ofthese missed boundaries were detected. Figure 5 shows one such dissolve transition. As shown in Figure 5,the color composition of the video frames before and after the shot boundaries are very similar, resulting insimilar color histograms. However, when the second shot dissolves in, the spatial distribution changes forsome of the colors (note the upper body of a person that begins to appear), resulting in dissimilarity in thecomputed distance feature values.

4. CONCLUSIONS AND FUTURE WORK

We have proposed a new image feature called Augmented Image Histogram. In addition to the intensity orcolor distribution, the Augmented Image Histogram captures the spatial information of image pixels interms of their relative distances. For each intensity or color i, the spatial information is compactly

represented by three numbers: the mean, variance, and entropy of pixel-to-pixel distances. These featurecomponents can be computed off-line for image retrieval. We have shown that by using a block-basedapproach, the computational requirement can be reduced. Experimental results have validated theeffectiveness of our approach. When the direct method is used to retrieve color images in a database of 854images, performance improvements that range from 16.27% to 26.51% (for scope values between 1 to 20)were obtained over conventional color histogram. Similar improvements were obtained for the block-basedmethod. When used for shot boundary detection in a color video sequence of 1,588 frames, significantimprovement was also obtained.

For future work, we intend to develop effective methods to obtain optimal weight values to compute thesimilarity measure in Equation (10). This may require the use of training images. We also intend toinvestigate other methods for combining the individual feature components in Equation (10). Finally, weplan to study the effect of decreasing the number of blocks in the block-based approach. This will cut downcomputation time but may lower the performance.

5. REFERENCES

[1] M. Flickner et al., "Query by image video content: The QBIC system", IEEE Computer, September1995, pp. 23-32.[2] A. Pentland, R. W. Picard, and S. Sclaroff, "Photobook: Content-Based manipulation of imagedatabases", Internationaifournal ofComputer Vision, 18(3), 1996, pp. 233-254.[311 T. S. Huang, S. Mehrotra, and K. Ramchandran, "Multimedia analysis and retrieval system(MARS)project", Proc. of33rd Annual Clinic on Library Application ofData Process -DigitalImage Access andRetrieval, 1996.[4] M. J. Swain, "Interactive indexing into image databases," Proc. SPIE: Storage Retrieval Image VideoDatabases, 1908, Feb. 1993, pp. 95-103.[5] Y. Gong, H. Zhang, H. Chuant, and M. Sakauuchi, "An image database system with content capturingand fast image indexing abilities," Proc. of the mt '1 Conf on Multimedia Computing and Systems, Mayl994,pp. 121-130.[6] M. Stricker and A. Dimai, "Color indexing with weak spatial constraints," Proc. Storage Retrieval StillImage Video Databases, 1V2670, Feb. 1996, pp. 29-39.{711 G. Pass and R. Zabih, "Histogram Refinement for Content-Based Image Retrieval ",IEEE Wor/chop onApplications ofComputer Vision, Sarasota, Florida, December, 1996.[8] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih, "Image Indexing Using ColorCorrelograms", IEEE Computer Vision and Pattern Recognition Conference, San Juan, Puerto Rico, June,1997.[9] F. Idris and S. Panchanathan, "Review of image and video indexing techniques," Journalof VisualCommunication and Image Representation, Vol. 8, No. 2, June, 1997, pp. 146-166.[10] M. Ortega, Y. Rui, K. Chakrabarti, S. Mehrotra, and T. Huang, "Supporting similarity queries inMARS", ACMMultimedia '97.

[11] CorelStock Photo Library, Corel Corporation.[12] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983.

529

(a)

(9, 14.00, 35.05, 1.77)

(b)

(9, 12.78, 100.90, 2.69)

(c)

(9, 2.00, 0.72, 1.77)

Figure 1. Different Pixel Patterns to Illustrate the Distinguishing Ability ofthe Augmented Histogram.(Directly below each image is the computed Augmented Histogram feature values)

(a)

!1J? 1111?

(b) (c)

Figure 2. The Frequency of Occurrence of Distance Values for the Images in Figure 1.

Table 1 . Image Retrieval Experimental Results: Precision and Recall for Scope Value =5

530

Category CountColor Histogram Direct Method Block-based Method

Precision(%) RecaIl(%) Precision(%) Recall(%) Precision(%) Recall(%)airshows 91 56.70 3.12 61.54 3.38 61.54 3.38sailboats 64 39.38 3.08 47.81 3.74 46.25 3.61

sunrises 60 48.00 4.00 71.00 5.92 74.33 6.19fireworksl 60 73.67 6.14 80.33 6.69 80.33 6.69fireworks2 32 69.38 10.84 78.75 12.30 77.50 12.11

cheetahs 84 67.86 4.04 75.48 4.49 74.52 4.44deserts 67 38.21 2.85 51.94 3.88 54.93 4.10fields 80 31.75 1.98 38.00 2.38 39.00 2.44

mountains 72 36.94 2.57 48.06 3.34 47.50 3.30brownbears 54 30.37 2.81 35.93 3.33 34.81 3.22

polarbears 24 28.33 5.90 57.50 11.98 53.33 11.11

baldeagles 91 60.44 3.32 68.13 3.74 67.69 3.72

tigers 75 70.40 4.69 79.73 5.32 80.80 5.39

. . U

U • N U. U R

.

I ?

09

08007(-)

0605a)><04(J

03a02

01•

2 4 6 8 10 12 14 16 18

Scope

Figure 3. Plot of Overall Average Precision vs. Scope

Color Histogram: 205 Direct Method: 16 Block-based Method: 20

Figure 4. Sample Query Images (left) and the Target Images (right) withTheir Ranks. (Lower ranks indicate higher similarity.)

Color Histogram— Direct Method— Block-based Method

Color Histogram: 41 Direct Method: I Block-based Method: I

Table 2. Experimental Results for Video Shot Segmentation

CorrectlyDetected

MissedBoundaries

FALSEBoundaries

ColorHistogram

21 3 8

87.50% 12.50% 33.33%Direct

Method23 1 5

95.83% 4.17% 21%Block-based

Method23 1 5

95.83% 4.17% 21%Total number of shot boundaries = 24

Total number of frames = 1,588

.;;":• , ,*.. . 3 1.., —

Frame I Frame 2 (Shot Boundary) Frame

Figure 5. Sample Shot Boundary Detected by Our Method.(It is missed when using conventional color histogram.)