Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos

Using Large-Scale Web Data to Facilitate Textual QueryBased Retrieval of Consumer Photos

Yiming Liu, Dong Xu, Ivor W. Tsang, Jiebo Luo

Nanyang Technological University & Kodak Research Lab

Motivation• Digital cameras and mobile phone cameras popularize

rapidly:– More and more personal photos;– Retrieving images from enormous collections of personal

photos becomes an important topic.

?How to retrieve?

Previous Work

• Content-Based Image Retrieval (CBIR)– Users provide images as queries to retrieve

personal photos.

• The paramount challenge -- semantic gap:– The gap between the low-level visual features and

the high-level semantic concepts.

…

Low-levelLow-levelFeature vectorFeature vector

Image with high-Image with high-level conceptlevel concept

queryquery resultresult

… …

Feature Feature vectorsvectorsin DBin DB

compare

SemanticGap

A More Natural Way For Consumer Applications

• Let the user to retrieve the desirable personal photos using textual queries.

• Image annotation is used to classify images w.r.t. high-level semantic concepts.

– Semantic concepts are analogous to the textual terms describing document contents.

• An intermediate stage for textual query based image retrieval.

queryquery

Sunset Annotation Result:high-level concepts

Annotation Result:high-level concepts

annotateannotate

compare

…

databasedatabase

…

resultresult

rankrank

Our Goal• Web images are accompanied by tags, categories and titles.

… …

building

people, family

people, wedding

sunset

… …

WebWebImagesImages

ContextualContextualInformationInformation

Web Images Consumer Photos

• Leverage information from web image to retrieve consumer photos in personal photo collection.

information

No intermediate image annotation

process.

No intermediate image annotation

process.

• A real-time textual query based consumer photo retrieval system without any intermediate annotation stage.

• When user provides a textual query,

TextualQuery

ClassifierAutomatic WebImage Retrieval

Automatic WebImage Retrieval

Large Collection of Web images

(with descriptive words)

Relevant/Irrelevant

Images

WordNet

RelevanceFeedback

RelevanceFeedback

Refined Top-Ranked

Photos

ConsumerPhoto Retrieval

ConsumerPhoto Retrieval

Raw Consumer Photos

Top-RankedConsumer

Photos

• It would be used to find relevant/irrelevant images in web image collections.

• Then, a classifier is trained based on these web images.

• And then consumer photos can be ranked based on the classifier’s decision value.

• The user can also gives relevance feedback to refine the retrieval results.

System Framework

“boat” InvertedFile

InvertedFile

Relevant Web Images

IrrelevantWeb Images

boat

ark barge

dredger houseboat

… …

… …

… …

… …

… …

Semantic Word TreesBased on WordNet

• For user’s textual query, first search it in the semantic word trees.

• The web images containing the query word are considered as “relevant web images”.

• The web images which do not contain the query word and its two-level descendants are considered as “irrelevant web images”.

Automatic Web Image Retrieval

Decision Stump Ensemble

• Train a decision stump on each dimension.

• Combine them with their training error rates.

Why Decision Stump Ensemble?

• Main reason: low time cost– Our goal: a (quasi) real-time retrieval system.– For basic classifiers: SVMs are much slower;– For combination: boosting is also much

slower.

• The advantage of decision stump ensemble:– Low training cost;– Low testing cost;– Very easy to parallelize;

Asymmetric Bagging

• Imbalance: count(irrelevant) >> count(relevant)– Side effects, e.g. overfitting.

• Solution: asymmetric bagging– Repeat 100 times by using different randomly sampled

irrelevant web images.

irrelevant images

relevant images

100 training sets

…

…

Relevance Feedback

• The user labels nl relevant or irrelevant consumer photos.– Use this information to further refine the

retrieval results;

• Challenge 1: Usually nl is small;

• Challenge 2: Cross-domain learning– Source classifier is trained on the web image

domain. – The user labels some personal photos.

Method 1: Cross-Domain Combination of Classifiers

• Re-train classifiers with data from both domain?– Neither effective nor efficient;

• A simple but effective method:– Train an SVM on the consumer photo domain with

user-labeled photos;– Convert the responds of source classifier and SVM

classifier to probability, and add them up;– Rank consumer photos based on this sum value.

• Referred as DS_S+SVM_T.

Method 2: Cross-Domain Regularized Regression (CDRR)

• Construct a linear regression function fT(x):– For labeled photos: fT(xi) ≈ yi;

– For unlabeled photos: fT(xi) ≈ fs(xi);

Source Classifier

Other images f T(x) should be f s(x)

• Design a target linear classifier f T(x) = wTx.

User-labeled images x1,…,xl

f T(x) should be the user’s label y(x)

A regularizer to control the complexity of A regularizer to control the complexity of the target classifier the target classifier ff TT((xx))

• This problem can be solved with least square solver.

Hybrid Method• A combination of two methods.• For labeled consumer photos:

– Measure the average distance davg to their 30 nearest unlabeled neighbors in feature space;

– If davg < ε: Use DS_S+SVM_T;

– Otherwise: Use CDRR.

• Reason: – For consumer photos which are visually similar to

user-labeled images, they should be influenced more by user-labeled images.

Experimental Results

Dataset and Experimental Setup

• Web Image Database:– 1.3 million photos from photoSIG.– Relatively professional photos.

• Text descriptions for web images:– Title, portfolio, and categories accompanied

with web images;– Remove the common high-frequency words;– Remove the rarely-used words.– Finally, 21377 words in our vocabulary.


• Testing Dataset #1: Kodak dataset– Collected by Eastman Kodak Company:

• From about 100 real users.• Over a period of one year.

– 1358 images:• The first keyframe from each video.

– 21 concepts:• We merge “group_of_two” and

“group_of_three_or_more” to one concept.


• Testing Dataset #2: Corel dataset– 4999 images

• 192x128 or 128x192.

– 43 concepts:• We remove all concepts in which there are fewer

than 100 images.

Visual Features

• Grid-Based color moment (225D)– Three moments of three color channels from each

block of 5x5 grid.

• Edge direction histogram (73D)– 72 edge direction bins plus one non-edge bin.

• Wavelet texture (128D)• Concatenate all three kinds of features:

– Normalize each dimension to avg = 0, stddev = 1– Use first 103 principal components.

Retrieval without Relevance Feedback

• For all concepts:– Average number of relevant images: 3703.5.


• kNN: rank consumer photos with average distance to 300-nn in the relevant web images.

• DS_S: decision stump ensemble.


• Time cost:– We use OpenMP to parallelize our method;– With 8 threads, both methods can achieve

interactive level.– But kNN is expected to cost much time on large-

scale datasets.

Retrieval with Relevance Feedback

• In each round, the user labels at most 1 positive and 1 negative images in top-40;

• Methods for comparison:– kNN_RF: add user-labeled photos into relevant

image set, and re-apply kNN;– SVM_T: train SVM based on the user-labeled images

in the target domain;– A-SVM: Adaptive SVM;– MR: Manifold Ranking based relevance feedback

method;


• Setting of y(x) for CDRR:– Positive: +1.0;– Negative: -0.1;

• Reason:– The top-ranked negative images

are not extremely negative;– Positive: “what is”; Negative:

“what is not”.positiveimages

negativeimages


• On Corel dataset:


• On Kodak dataset:


• Time cost:– All methods except A-SVM can achieve real-time

speed.

System Demonstration

Query: Sunset

Query: Plane

The User is Providing The Relevance Feedback …

After 2 pos 2 neg feedback…

Summary

• Our goal: (quasi) real-time textual query based consumer photo retrieval.

• Our method:– Use web images and their surrounding text

descriptions as an auxiliary database;– Asymmetric bagging with decision stumps;– Several simple but effective cross-domain

learning methods to help relevance feedback.

Future Work

• How to efficiently use more powerful source classifiers?

• How to further improve the speed:– Control training time within 1 seconds;– Control testing time when the consumer photo

set is very large.

Thank you!

• Any questions?

Documents

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos