Annotating Personal Albums via Web Miningjzwang/ustc11/mm2008/p459-jia.pdf · annotation propagation and inter-relationships between different graphs to re-rank the initial annotations

Annotating Personal Albums via Web Mining

Jimin Jia MOE-Microsoft Key Lab of MCC

University of Science and Technology of China Hefei 230027, China

+86-551-3601342

[email protected]

Nenghai Yu MOE-Microsoft Key Lab of MCC

University of Science and Technology of China Hefei 230027, China

+86-551-3600681

[email protected]

Xian-Sheng Hua Microsoft Research Asia

49 Zhichun Road Haidian District

Beijing 100080, China

+86-10-58963200

[email protected]

ABSTRACT Nowadays personal albums are becoming more and more popular due to the explosive growth of digital image capturing devices. An effective automatic annotation system for personal albums is desired for both efficient browsing and search. Existing research on image annotation evolves through two stages: learning-based

methods and web-based methods. Learning-based methods attempt to learn classifiers or joint probabilities between images and concepts, which are difficult to handle large-scale concept sets due to the lack of training data. Web-based methods leverage web image data to learn relevant annotations, which greatly expand the scale of concepts. However, they still suffer two problems: the query image lacks prior knowledge and the annotations are often noisy and incoherent. To address the above

issues, we propose a web-based annotation approach to annotate a collection of photos simultaneously, instead of annotating them independently, by leveraging the abundant correlations among the photos. A multi-graph similarity propagation based semi-supervised learning (MGSP-SSL) algorithm is used to suppress the noises in the initial annotations from the Web. Experiments on real personal albums show that the proposed approach outperforms existing annotation methods.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Retrieval Models.

General Terms

Algorithms, Experimentation

Keywords

Image annotation, personal albums, multi-graph, semi-supervised

learning, similarity propagation

1. INTRODUCTION Recent years have witnessed a rapid growth in the number of digital personal photo albums, due to the popularity of digital cameras, mobile phone cameras and other portable digital imaging devices. With such large amount of personal photo albums, users often encounter difficulties when browsing the album or searching

for certain photos. Therefore, effective photo album organization and management has emerged as an important topic [25]. Providing meaningful keywords for the albums is an effective way to manage photo albums, as text indexing and ranking techniques can be easily incorporated [33]. With the semantic annotations, it is more convenient for users to browse or search images from their albums. Furthermore, more and more people are willing to upload their own albums onto popular image

sharing communities like Flickr. If the uploaded albums are associated with accurate descriptions, it will significantly improve the experiences of image sharing and search on the Web.

Manually image annotation is accurate, but it heavily relies on human efforts and is thus labor-intensive and time-consuming. In order to reduce human efforts, contemporary commercial software products, such as Picasa and ACDSee, allow users to manually select images with the same topic and then give description texts

for the series of images. Although the existing software attempts to reduce the boring labeling work by providing friendly and attractive interfaces, it still requires intensive labors and time costs. Therefore, effectively automatic annotations for personal albums are highly desired.

Generally, automatic image annotation could be categorized into two scenarios: learning-based methods and web-based methods. In learning-based methods, a statistical model is built to learn a classifier or the joint probabilities between images and

annotations. The main problem of this scenario is that the learned models highly depend on the size and distribution of the training dataset. However, a sufficient training dataset is difficult to obtain since it requires high labor costs. For example, the commonly used image annotation dataset, Corel, only contains 371 words, and many popular words like ‘ipod’, ‘mp3’ are not included in the dataset. For an image of any kind, it is difficult to provide abundant and accurate annotations based on such limited-size

concept sets. Therefore, the existing learning-based methods are difficult to extend to large-scale image annotation tasks. On the other hand, web-based annotation methods leverage web-scale data, which are able to cover unlimited vocabulary. A typical web-based method is AnnoSearch, proposed by Wang et al. in [7]. For a target image, an initial keyword is provided to conduct a text-based search on the web database. Then content-based

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

MM’08, October 26-31, 2008, Vancouver, British Columbia, Canada.

Copyright 2008 ACM 978-1-60558-303-7/08/10...$5.00.

459

approach is adopted to search visually similar images, and annotations are extracted from descriptions of both visually and semantically similar images. Other methods [6][8][34] further extend Wang et al.’s framework and propose more general image annotation schemes where the initial keywords are not required.

Although web-based methods overcome the insufficiency of training dataset in learning-based methods, they still suffer two problems. One is the lack of prior knowledge about the query image, which also exists in learning-based methods. Actually, users introduce a lot of prior knowledge in the interpretation of an image, which is difficult to accomplish by computers. The other problem is that the annotations extracted from descriptions of web images are noisy and a refinement process is desired.

To address the aforementioned problems, we propose to simultaneously annotate a collection of photos, instead of an individual photo, by leveraging web image database. In order to overcome the lack of prior knowledge in a single image, we attempt to utilize image correlations to facilitate annotating personal photo albums. For example, if we have Figure 1(a) alone, it is difficult to tell what the architecture is even by human being. However, given the images in Figure 1(a)-(e), it will be much

easier to conclude that the architecture is a palace in the Forbidden City. We argue that the collection of images brings more prior knowledge and the image correlations in the album can be utilized to improve image annotation. Moreover, by grouping the images into different clusters, we can build better image queries to learn annotations via searching similar web images.

However, though multiple photos help infer better annotations from descriptions of web images, annotations are still noisy. To

address this problem, we propose a multi-graph similarity propagation based semi-supervised learning (MGSP-SSL) approach to reduce the noises and improve the annotation coherence. Various properties of personal photo albums like visual similarity, temporal consistency and word correlation are employed to mine the correlations in the album, and a modified graph-based semi-supervised learning is thus designed. It is worth highlighting that inter-relationships between different graphs are learned by a similarity propagation model. By using the modified

graph-based semi-supervised learning framework and similarity propagation model, annotations for images in the album are cleaned and re-ranked.

Another advantage of the framework is that user interaction could be easily incorporated to help improve the annotations. If users are not satisfied with certain results, they are able to modify them. The system correspondingly produces updated annotations by propagating users’ labeling to other images. Generally, users’

manual annotation on the selected images also reflects users’

preferences. As a result, the final annotations are highly related to users’ personalized interests.

The contributions of the work are multifold:

1. We propose to mine annotations for personal albums by leveraging web database. To the best of our knowledge, this is

the first work attempting to automatically annotate a batch of images simultaneously.

2. We exploit image correlations in personal albums to overcome the lack of prior knowledge in existing image annotation methods.

3. A multi-graph similarity propagation based semi-supervised learning (MGSP-SSL) method is used to reduce noises in the initial web-based annotations. The method combines both

annotation propagation and inter-relationships between different graphs to re-rank the initial annotations.

4. User interaction could be easily incorporated during annotation. Annotations for all images in the album will be accordingly re-ranked afterwards.

The organization of the paper is as follows. In Section 2, the related works will be introduced. In Section 3, the annotation framework will be presented in details. User interaction

incorporated in the system will be described in Section 4. Experimental results are shown in Section 5. Finally, we conclude the paper in Section 6.

2. RELATEDs WORK

2.1 Image Annotation for Personal Albums As aforementioned, existing automatic image annotation methods could be categorized into two classes: learning-based methods and web-based methods. Learning-based methods attempt to learn suitable classifiers or joint probabilities between images and annotations. Typical classifiers include support vector machine

(SVM) [3], Bayes point machine[35], Hidden Markov models [2], etc. Representative works on learning joint probabilities between images and annotations include Translation Model (TM) [20], Latent Dirichlet Allocation Model (LDA) [5] , Cross-Media Relevance Model (CMRM) [12] , Continuous Relevance Model (CRM) [16] , probabilistic semantic model [31] and Multiple Bernoulli Relevance Model (MBRM) [4], etc. However, these methods can only model limited number of semantic concepts on a small-scale image database. For example, only about 371 words

are contained in the commonly used dataset, Corel. Such limited number of concepts is not sufficient for annotating general personal photos. On the other hand, building a sufficient well-annotated training dataset is never a trivial task. Therefore learning-based methods are difficult to extend to large-scale image annotation tasks.

With the prosperity of web resources, several web-based methods [6][7][8] are proposed for image annotation by leveraging the huge deposit of web resources. These methods learn annotations from the descriptions of web images, which cover unlimited

(a) (b) (c) (d) (e)

Figure 1: Illustration of a collection of photos

460

vocabulary and thus overcomes the lack of training data in learning-based methods. One pioneer work is AnnoSearch [7] . In

this work, at least one accurate keyword is required to perform text-based search on the Web. Then content-based image search is carried out to search visually similar images, and annotations are mined from the text descriptions of these images. In [6], Wang et al. proposed an alternative content-based model for web image annotation which requires no initial keyword. Rui et al. [8] propose a bipartite graph model to extract annotations from web resources.

All the above methods can be directly applied to annotate personal photo albums, where each photo is annotated independently. However, important properties of albums, such as image correlations [27][28][32], are not taken into consideration, and the performances are less satisfactory.

With the advance of computer vision, especially in automatic face recognition, some works have been conducted on labeling faces in family albums by leveraging face detection and recognition technologies. Zhang et al. [10] use photo context information to generate a candidate name list for the detected faces. In order to reduce the cost of labeling them one by one, Cui et al. [9] attempt

to cluster all faces first and then manually label these clusters. All these works focus on annotating human faces in family albums and require users to manually label part of the photos as training data, whereas our proposed system avoids this step by mining annotations from web images.

2.2 Graph-Based Semi-Supervised Learning Semi-supervised learning is a class of machine learning techniques that makes use of both labeled and unlabeled data for training - typically with a small amount of labeled data and a large amount of unlabeled data. Manifold ranking is one of the graph-based semi-supervised learning algorithms, which explores the intrinsic relations among labeled and unlabelled data by constructing a similarity graph. Manifold ranking [11] has been

proved effective in many applications such as image retrieval [17] and image annotation [19]. In this approach, the to-be-annotated keywords (or semantic concepts) will be propagated through the similarity graph so that unlabelled images can be associated with proper keywords. Since images are correlated in personal albums, image correlations could be incorporated into manifold ranking

framework to mine more accurate annotations. Some important properties of personal albums like temporal consistency could be employed to construct different graphs and multi-graph propagation can be applied to refine the annotations.

2.3 Multi-Model Fusion Recently some works [21][22] focus on mining the inter-relationship among heterogeneous data types in text and image retrieval. Kandola et al. [21] proposed to leverage relationship between text and document to exploit the semantic similarity, and Wang et al. [22] proposed to mine relationship between different types of objects by similarity propagation. In this paper, we adopt similarity propagation model [22] in image annotation to build

inter-relationships between images and keywords. Then the integrated correlation graph could be used in the semi-supervised learning framework.

3. OUR APPROACH In this section, we will present the proposed framework in details.

Firstly an overview of the whole system is introduced, and then we describe the method of learning annotations for image clusters from web image database. Afterwards, a multi-graph similarity propagation based semi-supervised learning (MGSP-SSL) approach is detailed, which learns the final annotations of the photo album.

3.1 Overview Figure 2 illustrates the basic framework of our approach. Album

descriptions, which could be the folder names of albums or the keywords manually provided by users, are firstly obtained from photo albums. Different from AnnoSearch [7] which requires initial keywords for every individual image, only one or more keywords are required for the entire album in our approach. The description keywords are submitted to the web dataset to search keyword related images, which are used to learn annotations for candidate images. We select Flickr as the web image database since it contains large-scale images associated with user-labeled

tags. As photo albums often contain redundant images and overlapping scenes, image clustering is firstly conducted on the albums to group similar images into clusters. For each image cluster, better queries will be built by leveraging image correlations, and then content-based method is carried out to

Album

Descriptions

Album

Images

Web Image

DataBase

Image

clusters

Image Cluster 1

Annotations

Image Cluster 2

Annotations

Image Cluster N

Annotations

…...

Graph-based SSL

Graph 2

Graph 1

MGSP-SSL

Annotations

Figure 2: Framework of the Proposed Annotations System

461

search visually similar web images for the new queries. Afterwards, annotations for image clusters are learned from tags of both visually and semantically similar web images. To make the initial annotations less noisy and more coherent, a learning framework which incorporates both similarity propagation model

and semi-supervised learning is built, leveraging multiple correlation graphs such as the visual correlation graph, the temporal correlation graph and the word correlation graph. The final annotations are subsequently produced from the MGSP-SSL framework.

3.2 Album Clustering All images in personal albums are firstly represented in the form of visual words model [1], analogous to vector space model in text

retrieval. Then k-means clustering is carried out on all images in the album.

3.2.1 Image Representation Both album images and web images are represented in visual words model in the following manner. For each image, a set of region descriptors are extracted, which are able to deal with the

changes in viewpoint, illumination and partial occlusion. All the detected affine invariant regions are then represented by a 128-dimensional SIFT descriptors developed by Lowe [23]. SIFT descriptors are proved superior to others such as the response of a set of steerable filters or orthogonal filters. Compared with global features commonly used in [7][8], SIFT descriptors are much more suitable for matching objects for the albums since overlapping scenes or objects often exist in different images [24].

All SIFT descriptors are then grouped into clusters called visual words by vector quantization. The vector quantization is carried out by k-means clustering. It is worth noting that k-means can be

replaced by other state-of-the-art clustering methods as well. In our experiments, 2000 visual words are used for all images. After vector quantization, all images are represented as a 2000 dimensional vector.

3.2.2 Building Image Clusters In visual words model, each image i in the album could be

represented as a weighted vector of visual words:{𝑡𝑖1 , 𝑡𝑖2 , …𝑡𝑖𝑁}.

𝑡𝑖𝑗 represents the weight of the j-th visual word in the i-th image.

Similar to the weighted vector in text retrieval, each visual word in the image is weighted by ‘term frequency-inverse document frequency’(tf-idf):

logikik

i k

n Nt

n n (1)

where 𝑛𝑖𝑘 is the number of occurrences of the visual word k in

image i, 𝑛𝑖 is the total number of words in image i, 𝑛𝑘 is the

number of occurrences of word k in the whole database and N is

the number of documents in the whole database. The weight is a product of two terms: the word frequency 𝑛𝑖𝑘/𝑛𝑖 , and the inverse

document frequency 𝑙𝑜𝑔𝑁/𝑛𝑘 .

The visual similarity between two images is computed as the cosine distance between image vectors. Then k-means clustering is carried out on all image for grouping similar images.

In some object retrieval works [13], matching objects between different images is conducted in a more sophisticated way, where

the key points in one image are compared to all key points in the other image via proper transformation so that the same objects in different images could be matched more accurately. However, it is

time-consuming to match all key points between two images, especially when the size of the album is large. Considering the efficiency of the system, we directly compute the distance of visual word vectors. If the size of visual words is properly set, the performance of clustering is still satisfactory.

Since there are no common standards for setting image cluster number, we set it in a heuristic way. Here image cluster number is

set as proportional to the size of albums based on the assumption that the more images, the more topics they cover.

3.3 Annotations for Image Clusters Initial annotations for image clusters are mined from tags of semantically similar web images. Since keywords may be ambiguous(e.g. both ‘apple computer’ and ‘apple fruit’ are both relevant to the query ‘apple’), content-based retrieval is subsequently applied on web images to ensure visually similarity

with images in the albums. Then annotations for image clusters are learned from tags of both visually and semantically similar web images. Now that images in the same cluster are correlated, we build a new query vector Q to represent the cluster. The new query vector Q could be simply formed by averaging all images in an image cluster:

1

i

i cc

Q qN

(2)

where 𝑞𝑖 is the i-th image in cluster c, 𝑁𝑐 is the number of images

in c. Here 𝑞𝑖 is represented in the style of visual words vector.

Therefore, Q is constructed by combining the frequencies of all visual words in the cluster. Then Q is submitted to search visually similar web images, and annotations are learned from tags of

visually similar image set 𝐼∗. Top K words are selected as annotations for image clusters.

Top K words are ranked by incorporating both visual content and word frequencies. Intuitively, visual similar images are more likely to share annotations. Furthermore, the more times a

keyword appears in tags of image set 𝐼∗ , the more likely it should

be the annotation keywords. Therefore the final rank 𝑅 𝑤 for a

candidate word 𝑤 could be calculated as:

*

( , ) ( )( )

j

j I j

S Q j wR w

N

(3)

where 𝑆 𝑄, 𝑗 is the low level feature similarity between the query

vector Q and web image j, 𝑁𝑗 is the number of tags in image j.

𝛿𝑗 (𝑤) is the 0-1 function which indicates whether 𝑤 exists in the

image j:

1,( )

0,j

w exists in tags of image j

w not exists in tags of image jw

　 (4)

In Eq.(3), 𝑆 𝑄, 𝑗 is the visual content similarity part, and the sum

reflects the word frequency part. For web image j, the score for every tag 𝑤 is distributed uniformly among all its tags, that is

𝑆 𝑄, 𝑗 /𝑁𝑗 . Then the final rank 𝑅(𝑤) could be calculated by

summarizing the scores from all images tagged by 𝑤.

An interesting observation is that the new query vector Q is similar to the new query built from relevance feedback in image

retrieval. Note that in image retrieval, image search engine returns both positive and negative images for the query. The positive images are combined with the original query image to form a new

462

query to obtain better performance. Here the images in a cluster could also be viewed as positive images in relevance feedback. Since relevance feedback could bring improvement for image

retrieval, it is expected that annotating image clusters could also improve image annotation performance.

3.4 Multiple Graph Similarity Propagation

Semi-Supervised Learning In an ideal case, if the images in a cluster are highly correlated

and annotations are accurately obtained, the initial annotations for image clusters are considered good enough as annotations for every single image in the cluster. However, the performance of image clustering is not guaranteed, and the annotations learned from web image tags are often noisy and incoherent. In order to obtain more accurate annotations, initial annotations need to be refined. We propose a multi-graph similarity propagation based semi-supervised (MGSP-SSL) model to suppress the annotation

noise. Different correlation graphs in personal albums are integrated into a modified manifold ranking framework. Besides, inter-relationships between different graphs are mined by the similarity propagation model.

3.4.1 Graph-based SSL for Image Annotation Manifold ranking [11] is a typical graph-based semi-supervised

learning framework. Labels are estimated and ranked by the similarity propagation among all the data (labeled and unlabelled) on the graph. We firstly give a general review on manifold ranking algorithm in image annotation.

Suppose a personal photo album which contains m images 𝐼 = 𝐼1 , 𝐼2 ,…𝐼𝑚 and an annotation keyword collection 𝑤 ={𝑤1,𝑤2 ,…𝑤𝑛 } . The initial annotations of the photo album is represented by a 𝑚 × 𝑛 matrix Y whose element 𝑌(𝑖, 𝑗) denotes

the probability of image 𝐼𝑖 tagged by keyword 𝑤𝑗 . Every image is

initially tagged by annotations of the image cluster it belongs to. Denote S the visually similarity matrix among all images in the

photo album. We symmetrically normalize S by 𝑆′ =𝐷−1/2𝑆𝐷−1/2 , where D is the diagonal matrix and 𝐷(𝑖, 𝑖) equals to the sum of the i-th row of S. For convenience, the matrix after symmetrically normalization is still represented by the original symbol, i.e., 𝑆′ is still represented by S for simplicity. F is a

𝑚 × 𝑛 matrix whose elements indicate the final scores of keyword

collection w for all images in the album. In manifold ranking, F is iteratively computed as follows:

( 1) ( ) (1 )t tF SF Y (5)

where 𝛼 is the parameter indicating the weight of the contribution of visually similarity.

The convergence of 𝐹(𝑡) has been proved, and the steady state F is

1( )F I S Y (6)

3.4.2 Modified Graph-based SSL In Eq. (5), only visual correlation graph is considered for image similarity, while other correlation graphs, such as temporal correlation graph, have been neglected. Temporal correlations [26][29][30] serve as an important and intrinsic property of

personal albums. Timestamps could be easily obtained from digital capturing devices. We aim to incorporate temporal correlation into the manifold ranking algorithm. Intuitively, if two photos are taken at very close interval, they will share similar concepts or scenes with high probability. Even if they are taken casually and have totally different scenes, to some extent these scenes are still relevant to similar topics (e.g., circumstances of the photographers). Denote T the temporal correlation among all images. Temporal correlation is fused with visual correlation in a

linear fusion manner. The modified manifold ranking framework is represented as:

( 1) ( )[ (1 ) ] (1 )t tF S T F Y (7)

where 𝜆 is the coefficient indicating the modulation between two

correlation graphs. It could be easily proved that Eq. (7) still converges and the steady-state F is:

1[ ( (1 ) )]F I S T Y (8)

3.4.3 Multi-Graph Similarity Propagation Besides visual correlations and temporal correlations, word

correlations are also valuable for reducing the annotation noise and obtaining more coherent annotations. However, annotation keywords are assumed independent in Eq.(5). Actually, annotation keywords have strong correlations, and correlated words are more likely to co-exist in the final annotations. For example, ‘BMW’ and ‘car’ are semantically correlated. Accordingly, if an image was annotated as ‘BMW’, it has high probabilities of being labeled as ‘car’, even if ‘car’ is ranked low in initial annotations.

It is not convenient to directly integrate word correlations into the manifold ranking framework. Here word correlations are fused with visual correlations by similarity propagation model [22].

Similarity propagation model is used to mine the inter-relationships between different types of correlation graphs. In Eq.(7), visual correlation graph S is pre-computed by visual similarities of images and stay fixed during iteration. Actually, it has been proved [21] that the heterogeneous graphs are inter-related, and different graphs could be mutually affected. The influence of word correlation on the final annotations could be conducted by visual correlations via similarity propagation between the two graphs.

Figure 3 shows an example where visual similarity and word correlation are inter-related. In Figure 3, every image is associated

with one keyword (represented by solid line between images and words). Figure 3 (a) and (b) are not visually similar. However, the annotations for the two images ‘car’ and ‘traffic’ are semantically correlated (indicated by bold lines). Then the visual similarity of the two images should be correspondingly enhanced (indicated by dash lines). The enhanced visual similarity makes Figure 3 (b) more likely to be annotated by ‘traffic’ according to Eq. (7). On the other hand, word correlations are generally computed by

WordNet or the concurrence of two words in documents (see section 3.4.5). However, visual correlations also have impact on word correlations. Figure 3 (b) and (c) are visually similar. Therefore, the word correlation between ‘car’ and ‘luxury’ should

Figure 3: Illustration of similarity propagation

car luxurytraffic

(a) (b) (c)

463

be correspondingly reinforced. Denote S, T, W and F the visual correlation graph, the temporal correlation graph, the word correlation graph, and the relationship matrix between images and

keywords, respectively. Denote 𝑆(𝑡) and 𝑊(𝑡) are the

corresponding matrices after t-th iteration respectively. As in [22], the similarity propagation model is represented as:

( 1) (0) ( )

( 1) (0) ( 1)

(1 )

(1 )

t t T

t T t

S S FW F

W W F S F

(9)

where 𝑆(0) and 𝑊(0)are the initial similarity matrices, 𝐹𝑇 is the transpose of 𝐹, β and γ are the coefficients of linear fusion and

are often set empirically. From the equation, the updated

correlation graph, e.g. visual correlation graph 𝑆(𝑡+1), combines

both the original correlation graph 𝑆(0) and the influence of word

correlation graph 𝑊(𝑡) , which contributes to 𝑆(𝑡+1) through

image-keyword relationship matrix F. The convergence of the update process has already been proved in [22], and the steady-state of S and W are as follows:

1

1

(0) (0)

(0) (0)

[ (1 ) ][1 (1 )(1 ) ]

[ (1 ) ][1 (1 )(1 ) ]

T T

T T

S S FW F FF

W W F S F F F

(10)

The updated correlation graph S after similarity propagation in Eq.(9) reflects the influence of both visual correlation and word correlation. The problem of the lack of word correlation in Eq. (7) could be avoided in this way, and the steady-state correlation graph S could now be used in Eq. (8) to refine the initial annotations. At this stage, the semi-supervised learning method

simultaneously integrates three correlation graphs: the visual correlation graph, the word correlation graph and the temporal correlation graph.

3.4.4 MGSP-SSL Steps In this section, we list the MGSP-SSL computation steps for refining the initial annotations. After extracting initial annotations

for image clusters, the final annotations are obtained in the following steps:

1) Assign each image the keywords of its cluster as the initial

annotations. 2) Build the annotation matrix F according to the initial

annotations in step 1. 3) Build the word correlation graph 𝑊0, the visual correlation

graph 𝑆(0) and the temporal correlation graph T.

4) Compute the updated correlation graph S by similarity propagation in Eq. (10).

5) Compute the final annotations for each image in the manifold ranking Eq. (8) using the steady-state S.

3.4.5 Graph Construction Visual Correlation Graph

With the visual word model for images, the visual similarity of images could be measured analogous to that of document vectors.

Since each image is represented by a visual word vector, cosine similarity measure is adopted here for visual similarity between image i and j, i.e. 𝑆 𝑖, 𝑗 = 𝑑𝑖 ∙ 𝑑𝑗/∥ 𝑑𝑖 ∥∥ 𝑑𝑗 ∥. It is noteworthy

that other similarity measurements could be applied here as well. Since our system is a general framework, different measurements could be conveniently employed.

Temporal Correlation Graph

There are several methods in modeling temporal consistency. Wang et al. [14] consider the temporal relationship between two adjacent units, and a Gaussian function is proposed to model temporal consistency among all frames in [15]. Here we assume that the temporal correlation could be neglected if images are

taken at a long interval. Therefore we set a threshold here like Eq.(11). If the interval between two images is more than the

threshold ε , temporal correlation is set to 0, otherwise, it will be

modeled in a Gaussian manner like [15]. The temporal correlation graph is represented as:

2

2

|| ||exp{ } | |

( , ) 2

0 | |

i j

i j

i j

t tt t

T i j

t t

(11)

Word Correlation Graph

The word correlations define the semantic distance between two words. A normalized google distance (NGD) [18] was proposed to obtain word correlations from Google in a statistical way. Considering that the annotations are mined from tags of Flickr here, we define a normalized flickr distance (NFD) similar to

NGD between two words 𝑤𝑖 and 𝑤𝑗 :

max{log ( ), log ( )} log ( , )( , )

log min{log ( ), log ( )}

i j i j

i j

i j

f w f w f w wNFD w w

G f w f w

(12)

where G is the total number of images on Flickr. 𝑓(𝑤𝑖) is the

number of images tagged by 𝑤𝑖, and 𝑓(𝑤𝑖 ,𝑤𝑗 ) is the number of

images tagged by both 𝑤𝑖 and 𝑤𝑗 .

The smaller the value of NFD, the greater correlation exists

between the two words. Then a general exponential function [34] is used for computing word correlations as follows:

( , ) exp( ( , ))i j i jW w w NFD w w (13)

Therefore, greater W(wi, wj) indicates greater relevance between the two words.

4. USER INTERACTION Due to the imperfection of image clustering and noisy annotations, final annotations may not meet users’ expectations. In our system, user interaction can be easily incorporated at any stage and will have direct impact on final annotations.

User Interaction in Image Clustering

At the image clustering stage, if users are not satisfied with image

clustering results, they are able to adjust the image clusters by manually removing images from clusters. Besides, they could merge different images into one cluster or split one image cluster into multiple clusters. The system will build new query vector for each image cluster according to users’ operations. Then annotations for the new image clusters will be re-ranked and updated. The introduction of user interaction for image clustering will have direct influence on the results, since all the following steps are implemented on the basis of the initial annotations.

User Interaction in Image Annotation

When browsing albums, users are allowed to randomly select images and manually provide description words for them. We

464

utilize these user-labeled keywords to gradually improve annotation performance. Actually, images with user-labeled tags will have impact on other correlated images in the album and the overall performance will be improved. More specifically, in Eq. (7), F has been changed during user interaction. Then the semi-supervised learning algorithm will be re-computed until the new F converges. Therefore, the annotations for all images are re-ranked with the help of users-labeled images.

5. EXPERIMENTS

5.1 Dataset We crawled 20 different personal albums from Flickr as the experiment datasets. These albums contain various kinds of photos which are captured while the photographers are travelling at different places around the world, such as the Great Wall in Beijing and the Golden Bridge in Los Angeles. All albums are provided with several keywords to perform text-based search from

web. These keywords are mainly about the travelling places described in the album.

Figure 4 shows part of images from a typical personal album, which describes a trip to the Summer Palace in Beijing. We will take this album as the example to illustrate our experiments in the

following sections. The album covers various topics of photos such as landscapes and portraits. There are 83 images in the album, which covers 8 hours interval. It could be seen that, although

photos are taken casually, there are overlapping scenes or objects in different photos, such as the Kunming Lake, the Foxiang Palace, bridges and boats. In particular, some photos are nearly duplicated. By utilizing these overlapping scenes or objects, image correlations could be exploited and no extra efforts are needed for

annotating redundant images. Although users are often expected to provide tags for their uploaded photos on Flickr, they are actually not very active to label all these uploaded images. For this album, users only provide several keywords to describe the whole album such as ‘Peking’, ‘Summer Palace’. Based on these limited number of words, it is not very convenient for browsing the album or searching certain images. From this point of view, it highlights the significance of automatically annotating personal

album. ‘Summer Palace’ is taken as the description words and submitted to Flickr to search semantically similar images. 5083 images are downloaded with 4055 keywords in all. Annotations will be learned from tags of downloaded web images.

5.2 Evaluation Image annotation performance is evaluated by comparing the annotation results with the ground truth labels. The performance is averaged over all test albums. Since some albums are associated

with no user labeled tags or with meaningless tags, ground truth tags are manually labeled by volunteers who have little knowledge on image annotation. Then the performance could be evaluated by the average precision (AP), which measures the average accuracy of the top N words in final annotations among all test images. For the i-th album, average precision could be calculated as:

1

@ ( ) ( , , ) /ij Ci

AP N i accuracy i j N NM

(14)

where 𝑀𝑖 represents the count of images in the i-th album,

accurate(i, j, N) stands for the correct words among top N

annotations for j-th image in i-th album, 𝐶𝑖 is the image collections for i-th album.

Then the overall average precision (AP) is averaged among AP of all test albums. C is the number of test albums.

1

@ @ ( )i C

AP N AP N iC

(15)

Since annotations are mined from tags of semantically similar web images which definitely contain the description keywords submitted to Flickr, we eliminate these description keywords from the final annotations to make the evaluation more reasonable.

5.3 Experimental Results

5.3.1 Performance of Image Clustering Figure 5 shows part of the image clustering results on the personal albums. Since images cover various topics and scenes, we empirically set a relatively greater cluster number (we set it to 20 in the example). It could be observed that the images in some clusters are rather coherent. For example, the towers in Figure 5 (a) are photographed in different viewpoints, but they could still be grouped into the same image cluster. This proves that the SIFT

descriptors are proper for grouping images for personal albums which often contain overlapping scenes or objects with different viewpoints. Figure 5 (b) includes four near duplicated images, which will cost redundant efforts to annotate them independently. Some of image clusters are not correctly grouped, like figure (c)

Figure 4: An example of personal album

465

and (d), which should be split into several clusters and some images need to be merged to other clusters.

Figure 5 also shows the initial annotations learned from web

image tags. For each image cluster, top 10 visually similar web images are selected as candidate image set whose tags are ranked to obtain annotations. We find that annotations contain abundant annotations, such as the cross-language keywords. The reason is that the web-based methods mine annotations from web image tags, which cover unlimited-size concepts. Some of the annotations in Figure 5 (a) and (b) are acceptable and coherent, while annotations for (c) and (d) are relatively diverse because of the poor performance of image clustering.

5.3.2 Performance of Image Annotation We compare the performance of four different annotation methods: baseline, Anno Collection, Manifold-Ranking and our proposed MGSP-SSL.

The baseline method is applying the web-based annotation method similar to AnnoSearch [7] which annotates all images in the album independently. The album descriptions serve as queries to search semantically related images for every image in the

album. Then content-based method is conducted to search visually similar images from keyword related images. Finally, the

annotations are directly learned from tags of visual similar images by Eq. (3) instead of using SRC in AnnoSearch.

The Anno Collection method first group images into multiple clusters and build new query vectors for each cluster. Then annotations of image clusters are learned in a similar way to the baseline method. Afterwards, each image is annotated by the annotations of the image cluster it belongs to.

The Manifold Ranking method incorporates only visual similarity graph to re-rank the initial annotations, i.e., the similarity

propagation model is neglected. Then top N annotations are selected as final annotations.

Figure 6 shows the average precision of top 7 words. From the

figure it is obvious that the proposed MGSP-SSL method achieves the best overall performance among the four methods. The relative improvements of MGSP-SSL over the baseline, Anno Collection and Manifold Ranking methods are 9.3%, 6% and 4.7% respectively for top 1 word. The overall performance ranking is MGSP-SSL > Manifold Ranking > Anno Collection > baseline. This indicates that annotating a collection of correlated images could produce better annotation results than annotating each

image individually. By propagating the annotations among images in the album, more accurate annotation results could be obtained. Besides, the superior performance of MGSP-SSL over Manifold-Ranking also indicates that the performance could be improved by fusing visual similarities and word correlations before semi-supervised learning, which proves the effectiveness of similarity propagation model. In Figure 5, a ‘lake’ image in the cluster (d) is similar to the image collections in cluster (b). Thus, the

annotations for cluster (b) have higher probabilities to be propagated to the ‘lake’ image in (d). Although similarity propagation makes annotations worse for some images, the overall performance has still been improved.

5.3.3 User Interaction Study Users may be not satisfied with the results provided by the system. They are able to adjust the results for better performance at any

stage of image annotation. In this section, we exploit the performance of the annotation system when user interaction is introduced at different stages.

5.3.3.1 User Interaction in Image Clustering From Figure 5, we find that the performance of image clustering is not very satisfying. For image clusters with diverse images,

Figure 6: Performance comparison of different

annotation methods

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6 7Top N

Baseline Anno Collection Manifold Ranking MGSP-SSL

green rooftiles sumerutemple xumilingjing boats dino

prchina tiles palace miami

asia 颐和园 mountains cixi

中国 fall vacation 中國

autostitch lake

(a) (b)

autostitch boat briansphotos lake panorama asia garden

emperor window water

forsythia yellow 花 railings

fragrance towerbudhistincense buddhist emperor palace

fragrance

(c) (d)

Figure 5: Example of image clusters and annotations. The

annotation words include cross-language(English and

Chinese) words. The translations of Chinese words are: 颐

和园-SummerPalace; 中国–China; 中國–China; 花–flower.

466

users are asked to manually reorganize these image clusters to make images in the same cluster more coherent. Then new query vectors are built on the new image clusters for learning annotations from web image tags.

The purpose of this experiment is two-fold: one is to test the annotation performance when user interaction is incorporated and the other is to test how much performance improvement can be achieved with better image clustering results.

Three methods are compared in this scenario: baseline, Anno Collection, and Anno Collection with user interaction (UI).

The average precision values are illustrated in Figure 7. From the figure we can see that, after user interaction, the performance has been significantly improved compared with both the baseline and Anno Collection. This indicates that user interaction have great influence on the performance of image annotation. Besides, better

image clusters will bring improvements on the final results. In other words, more coherent images in image clusters could provide better queries to search more similar web images, which subsequently produce annotations with less noise. On the reverse side, if all photos in the albums are totally irrelevant, building image clusters will produce noisy query vectors and result in worse annotations.

5.3.3.2 User Interaction in Image Annotation Users are able to manually provide annotations for images when browsing the album. In our experiments, users are asked to

manually give annotations for 10 randomly selected images. Then three annotation methods are compared: baseline, baseline with user interaction (+UI), MGSP-SSL with user interaction (+UI). Table 1 shows the performance comparison of the three methods. From the table, baseline with user interaction brings less improvement for the whole system, as it only corrects annotations

of human labeled images. However, MGSP-SSL with user interaction could bring more improvement because it not only corrects those manually labeled images but also propagates the accurate annotations to other correlated images.

6. CONCLUSIONS AND FUTURE WORK In this paper, we have proposed a web-based method to annotate personal photo albums. Different from the existing annotation methods, photo correlations in the album are leveraged to generate more accurate annotations. Initial annotations are obtained for image clusters instead of individual image. A multi-graph

similarity propagation based semi-supervised model is proposed to refine the noisy annotations for all images in the album. In addition, user interaction could be easily incorporated at any stage and help improve the overall performance. Experiments on real personal photo albums have demonstrated better performance than the existing methods.

In the future, we will continue our work in two directions. First, we will study how the performance changes with the increase of images in the album and evaluate its performance on a large number of general personal photo albums. Second, we will attempt to apply the system to more realistic applications, such as

image summarization, personalized image annotation and efficient image search and browsing.

7. ACKNOWLEDGEMENTS We would like to thank the reviewers for their valuable and constructive comments. The research is supported in part by

National Natural Science Foundation of China (60672056) and Specialized Research Fund for the Doctoral Program of Higher Education (20070358040).

8. REFERENCES [1] J, Sivic., A, Zisserman. Video Google: A text retrieval

approach to object matching in video. In Proceedings of International Conference on Computer Vision (ICCV), 2003.

[2] J. Li and J.Z. Wang, Automatic Linguistic Indexing of

Pictures by a Statistical Modeling Approach. IEEE Trans. On Pattern Analysis and Machine Intelligence, 2003. 25(19): p.1075-1088.

[3] C. Cusano, G. Ciocca, and R. Schettini. Image Annotation Using SVM. Proceedings of Internet Imaging, Vol. SPIE 5304. 2004.

[4] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In Proc. of CVPR, Washington, DC, June, 2004.

[5] D. M. Blei and M. I. Jordan. Modeling annotated data. In Proc. SIGIR, Toronto, July. 2003.

[6] C. Wang, F. Jing, L. Zhang and H.J. Zhang. Scalable search-based image annotation of personal images. In Proceedings of the 8th ACM international workshop on Multimedia information retrieval. ACM Press New York, NY, USA, 2006, 269—278

[7] X. Wang, L. Zhang, F. Jing and W.Y. Ma. AnnoSearch: Image Auto-Annotation by Search. International Conference

Figure 7: Performance comparison after user

interaction in image clustering.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7

Top N

Baseline Anno Collection Anno Collection + UI

Table 1: Performance comparison after user interaction in manually providing annotations

Top N Baseline Baseline + UI MGSP-SSL + UI

1 0.229 0.238 (+3,96%) 0.264 (+15.4%)

2 0.21 0.219 (+4.68%) 0.251 (+19.7%)

3 0.196 0.206 (+5.31%) 0.241 (+23.4%)

4 0.188 0.2016 (+7.25%) 0.226 (+20.47%)

5 0.181 0.1968 (+8.78%) 0.215 (+18.9%)

467

on Computer Vision and Pattern Recognition. Washington, DC, 2006, 1483-1490.

[8] X. Rui, M. Li, Z. Li, W.Y. Ma, Yu,N. Bipartite Graph Reinforcement Model for web image annotation. In Proceedings of the 15th Annual ACM International Conference on Multimedia, 2007

[9] J. Cui, F. Wen, R. Xiao, Y. Tian, X. Tang. EasyAlbum: An Interactive Photo Annotation System Based on Face Clustering and Re-ranking. In Proceedings of SIGCHI, 2007.

[10] L. Zhang, L. Chen, M. Li, H. Zhang. Automated annotation of human faces in family albums. In Proceedings of ACM Multimedia, 2003.

[11] D. Zhou, J. Huang, and B. Schölkopf. Learning with Local and global consistency. 18th Annual Conference on Neural Information Processing System, 2003.

[12] J. Jeon, V. Lavrenko and R. Manmatha. Automatic Image Annotation and Retrieval Using Cross-media Relevance Models. In Proc. of SIGIR, Toronto, July 2003.

[13] J Philbin, O Chum, M Isard, J Sivic, A Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007.

[14] M. Wang, X. Hua, X. Yuan, Y. Song and L. Dai. Optimizing multi-graph learning: towards a unified video annotation scheme. Proceedings of the 15th international conference on Multimedia, 2007.

[15] J. Tang, X. Hua, T. Mei, G. Qi, S. Li and X. Wu. Temporally Consistent Gaussian Random Field for Video Semantic

Analysis. IEEE International Conference on Image Processing, 2007.

[16] V. Lavrenko, R. Manmatha and J. Jeon. A Model for Learning the Semantics of Pictures. In Proc. NIPS, 2003.

[17] X. He, W.Y. Ma and H.J. Zhang. Learning an image

manifold for retrieval. In Proc. of ACM international Conference On Multimedia, 2005.

[18] RL Cilibrasi, PMB Vitányi. The google similarity distance. IEEE Trans. on Knowledge and data engineering, 2007

[19] J. Liu, M. Li, W.Y. Ma, Q. Liu and H. Lu An Adaptive Graph Model for Automatic Image Annotation. ACM Workshop on Multimedia Information Retrieval (MIR), 2006.

[20] P. Duygulu and K. Barnard. Object recognition as machine

translation: learning a lexicon for a fixed image vocabulary. In Proc. of ECCV, 2002.

[21] J. Kandola, J. Shawe-Taylor, N. Cristianini. Learning Semantic Similarity. Annual Conference on Neural Information Processing System, 2003.

[22] X. Wang, W.Y. Ma, G. Xue, X. Li. Multi-Model Similarity Propagation and its Application for Web Image Retrieval. In Proceedings of ACM International Conference on Multimedia, 2004

[23] D. Lowe. Local feature view clustering for 3D object recognition. In Proc. CVPR, 2001.

[24] H. Lejsek, F. Ásmundsson, B. Jónsson, L. Amsaleg. Scalability of local image descriptors: a comparative study. ACM Multimedia 2006: 589-598.

[25] S. Boll, P. Sandhaus, A. Scherp, U. Westermann. Semantics, content, and structure of many for the creation of personal photo albums. ACM Multimedia 2007: 641-650

[26] X Lian, L Chen, JX Yu, G Wang, G Yu. Similarity Match Over High Speed Time-Series Streams. ICDE 2007: 1086-1095

[27] F. Golshani. EIC's Message: Multimedia is Correlated Media. IEEE MultiMedia 11(1): (2004)

[28] A. Zunjarwad, H. Sundaram, L. Xie. Contextual wisdom: social relations and correlations for multimedia event annotation. ACM Multimedia 2007: 615-624

[29] Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, B. Tseng Detecting splogs via temporal dynamics using self-similarity analysis. TWEB 2(1): (2008)q

[30] M. Cooper, J. Foote, A. Girgensohn, L. Wilcox. Temporal event clustering for digital photo collections. TOMCCAP 1(3): 269-288 (2005)

[31] R. Zhang, Z. Zhang, M. Li, W.Y. Ma, H. Zhang. A Probabilistic Semantic Model for Image Annotation and Multi-Modal Image Retrieval. ICCV 2005: 846-851

[32] W. Klas, R. King. Context-Aware Multimedia. Encyclopedia of Multimedia 2006

[33] L. Hardman, J. Ossenbruggen. Creating meaningful multimedia presentations. ISCAS 2006

[34] J. Liu, B. Wang, M. Li, Z. Li, W.Y. Ma, H. Lu and S. Ma. Dual Cross-Media Relevance Model for Image Annotation. In Proceedings of the 15th Annual ACM International Conference on Multimedia 2007.

[35] E. Chang, et al. CBSA: Content-Based Soft Annotation for Multimodal Image Retrieval Using Bayes Point Machines. CirSysVideo, 2003. 13(1): p. 26-38.

468

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4378863



http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/l/Lejsek:Herwig.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/=/=Aacute=smundsson:Fri=eth=rik_Hei=eth=ar.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/j/J=oacute=nsson:Bj=ouml=rn_=THORN==oacute=r.html

http://www.informatik.uni-trier.de/%7Eley/db/conf/mm/mm2006.html#LejsekAJA06a

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/s/Sandhaus:Philipp.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/s/Scherp:Ansgar.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/w/Westermann:Utz.html

http://www.informatik.uni-trier.de/%7Eley/db/conf/mm/mm2007.html#BollSSW07

http://www.informatik.uni-trier.de/%7Eley/db/conf/icde/icde2007.html#LianCYWY07

http://www.sigmod.org/dblp/db/journals/ieeemm/ieeemm11.html#Golshani04

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/z/Zunjarwad:Amit.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/x/Xie:Lexing.html

http://www.informatik.uni-trier.de/%7Eley/db/conf/mm/mm2007.html#ZunjarwadSX07

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/l/Lin:Yu=Ru.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/c/Chi:Yun.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/t/Tatemura:Jun=ichi.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/t/Tseng:Belle_L=.html

http://www.informatik.uni-trier.de/%7Eley/db/journals/tweb/tweb2.html#LinSCTT08

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/c/Cooper:Matthew_L=.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/f/Foote:Jonathan.html

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/g/Girgensohn:Andreas.html

http://www.informatik.uni-trier.de/%7Eley/db/journals/tomccap/tomccap1.html#CooperFGW05



http://www.sigmod.org/dblp/db/indices/a-tree/z/Zhang:Ruofei.html

http://www.sigmod.org/dblp/db/indices/a-tree/l/Li:Mingjing.html

http://www.sigmod.org/dblp/db/indices/a-tree/m/Ma:Wei=Ying.html

http://www.sigmod.org/dblp/db/indices/a-tree/z/Zhang:HongJiang.html

http://www.sigmod.org/dblp/db/conf/iccv/iccv2005-1.html#ZhangZLMZ05

http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/k/King:Ross.html

http://www.informatik.uni-trier.de/%7Eley/db/reference/mm/mm2006.html#KlasK06



http://www.informatik.uni-trier.de/%7Eley/db/indices/a-tree/o/Ossenbruggen:Jacco_van.html

http://www.informatik.uni-trier.de/%7Eley/db/conf/iscas/iscas2006.html#HardmanO06

Documents

Annotating Personal Albums via Web Miningjzwang/ustc11/mm2008/p459-jia.pdf · annotation propagation and inter-relationships between different graphs to re-rank the initial annotations