A System for Large-scale, Content-based Web Image Retrieval - and the Semantics within

Thursday, May 27, 2004

A System for Large-scale, Content-based

Web Image Retrieval- and the Semantics within

Till Quack


Create a content-based image retrieval system for the WWW Large-scale, one order of magnitude larger than existing

systems. Means O(106) items Relevance Feedback

Explore and exploit the semantics within

Take large-scale, content-based image retrieval one step closer to commercial applications

Task


Outline Content-based Image Retrieval on the WWW PART I: A System for Image Retrieval on the WWW

Features Retrieval Relevance Feedback Software Design

PART II: The Semantics within Identifying a Method to find Semantics Data Mining for Semantic Clues Frequent Itemset Mining and Association Rules The Visual Link

Discussion & Demonstration Conclusions & Outlook


Content-based Image Retrieval on the WWW Characteristics of the data repository

Size: 4.2 billion documents in Google’s index Diversity: Documents in any context, language Control: Anybody can publish anything Dynamics: Ever changing

System Requirements FAST SCALABLE Make use of all the information

available Motivation for a new system

Existing systems • Either pure text (Google)• Or pure content-based

Large-Scale


PART I: A System for Large-scale, Content-based Image Retrieval on the WWW

Ullrich MoenichTill QuackLars Thiele


System Overview

Keywords

VisualFeatures

Image SpiderWorld Wide Web

DMOZData

KeywordExtraction

FeatureExtraction

Image DescriptionImages

(Binaries)

KeywordIndexing

Clustering

Keyword Request

Nearest Neighbor Search

Matching Images

User picksrelevantimages

Matching Images

Inverted Indexkeyid | imageid

RetrievalOffline Cluster nCluster 2Cluster 1

Cluster nCluster 2Cluster 1



mySQL


Visual Features describe the Images Global Features from MPEG-7 Standard

Currently no Segmentation • Reasons: Scalability and the diversity of the data

Texture Features Edge Histogram Descriptor (EHD)

• Histogram of quantified edge directions. 80 dimensions Homogeneous Texture Descriptor (HTD)

• Output of Gabor filter-bank. 62 dimensions. Color Features

Scalable Color Descriptor (SCD)• Color Histogram. 256, 128, 64 or 32 dimensions

Dominant Color Descriptor (DCD)• Up to 8 dominant colors (3d color-space) and their percentages

– 32 “dimensions”

• “Bins” defined for each image


Collateral Text as an additional Feature

ALT Tag and Collateral Text around images VERY uncontrolled annotation

Stemming: Porter Stemmer Example: training -> train More matching terms for boolean queries But also some new ambiguities

• train: to train [verb] / the train [noun]


Retrieval in 2 Steps

Keywords

VisualFeatures

Image SpiderWorld Wide Web

DMOZData

KeywordExtraction

FeatureExtraction

Image DescriptionImages

(Binaries)

KeywordIndexing

Clustering

Keyword Request

Nearest Neighbor Search

Matching Images

User picksrelevantimages

Matching Images

Inverted Indexkeyid | imageid

RetrievalOffline Cluster nCluster 2Cluster 1




mySQL

1. Text Retrieval

2. Visual Nearest Neighbor Search


Retrieval: Text Options

Boolean query on inverted index

Vector Space Model LSI etc.

Choice Ranked boolean queries on

inverted index Ranking: tf*idf

Reasons Speed Sparsity of data:

• 600 000 Keywords in total

• 1 document: 10-50 words

Keyword ImageId tf

shoe 1233 1

sport 1233 1

red 1233 1

banana 1234 1

fruit 1234 2

Order 1234 1

Keyid ImageId tf

124 1233 1

341 1233 1

345 1233 1

445 1234 1

75 1234 2

875 1234 1


Retrieval – Visual Features (MPEG-7) K-Nearest Neighbor search (K-NN)

Find K closest candidates ci to query image q in a vector space

Distance: Minkowsky Metrics for distance d(ci,q) namely L1 and L2 norms

Most MPEG-7 descriptors are high-dimensional vectors The “dimensionality curse” applies High dimensional spaces behave

“weirdly” In particular the distances are not too

meaningful


Retrieval – Challenges for Visual Features

We have several (visual) feature types How can we combine them?

Our database is very large.How can we search it fast enough? i.e. how can we avoid comparing the query vector with each

database entry?


A Combined Distance for the MPEG-7 Features

We use a combined distance of all the visual feature types The individual distances occupy different ranges in different

distributions The distributions were transformed to a normal distribution in

the range [0,1] The distances are then combined linearly


Clustering speeds up the search Problem

Millions of items in DB Linear search over the whole dataset

too slow Looking only for the K nearest

neighbors anyway (One) Solution

Partition the data into Clusters, identified by representative, the centroid

Only search the cluster whose centroid is closest to query q

K-Means clustering algorithm Not the best, in particular in HD spaces But fast!

Problem with Clustering: Query at the border of a cell does not

find all the nearest neighbors Simple Solution:

Overlapping Clusters Problem: Scalability

• Original data 7GB• Overlapping data: 50 GB

Imageid Primary Descriptor

Secondary Descriptor 1



122 ehd htd scd dcd

45233 ehd htd scd dcd

6688 ehd htd scd dcd


Relevance Feedback Improves the Results Relevance feedback: User input to improve search results -

iteration by iteration i.e. the user selects „good matches“ We obtain the following information:

1. A new query vector which is a combination of the relevant images = Query Vector Movement

2. The ratios for the combination of the feature types


Relevance Feedback: Query Vector Movement Construct the query vector qn of images selected in

iteration n

Vector component kFeature type f (EHD,SCD,HTD)i=1...M relevant images

The final, new query vector is

q = 0.75 *qn + 0.25 *qn-1

i.e. move from the old query vector towards the new vector


Relevance Feedback: Weight Adapation Which feature is most important for the given query?

The one for which all the relevant images are closest

Determine the ratios for the combination based on the average distance, e.g. for the EHD

and set


Implementation – Software and Hardware Languages: C++ and Perl

Inline::CPP to connect Layers

WWW: Apache and CGI Relational DB: mySQL Operating System: OS X Hardware

Dual 2 GHZ Apple G5, 2GB RAM

Teran Terrabyte Disk Array


Part II: The Semantics Within


Semantics: Combining Text and Visual Features Our dataset is multi-modal

Keywords and several visual features

Not only valid for WWW data• Video: image+speech, • Bio-imagery:

image+microscope setting, cell coloring fluid

Goal: Try to jointly use the different modes Do semantic relations between

the modes exist? Learn something about these

semantic relations Improve the retrieval precision

based on them Challenges in our project:

Large-scale Noisy and uncontrolled data Only global visual features


Identifying a Method to find the Semantics Related work

Latent Semantic Indexing (LSI) [Westerveld 2000]• – problem O(N2m3), N=Documents+Terms, m=concept space

Statistical models [Barnard, Forsyth 2001-2004]• Problem O: “several hours for several thousand images”• Problem: It is a (rather strict, hierarchical) model

Others• Neural networks (SOM etc.)• Hidden Markov Models

Often: Classification We don’t know our classes, or: there are just too many We can’t train them either (data too diverse and noisy)

Most of the methods above only tested on relatively small, supervised datasets

There is one more option …


Method: Data Mining for Semantic Clues Mine the data for patterns

Find them only where they exist Deduce Rules from them Scalable methods available

Frequent Itemset Mining and Association Rules Classic Application: Market baskets, Census data … Some works on Multimedia data

• [Zaïane 98]: Datacubes with appended keywords• [Tešić et al. 03]: Perceptual associations (texture) within

images


Frequent Itemsets and Association Rules Itemset I

Transaction T

Database D Support of Itemset A

A is called frequent if

Rule

Support of a Rule Statistical significance

Confidence of a Rule Strength of implication Maximum likelihood estimate that B

is true given that A is true


Example & Advantages Example: Market Baskets

Rule {Diaper,Milk}{Beer}

Advantages Human readable Can be edited Fast Algorithms available

Note: Associations are not correlations The same concept, just simpler Associations and correlations:

[Brin, Motwani, Silverstein 98]

TID Items

1 Bread, Milk

2 Beer, Diaper, Bread, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Bread, Diaper, Milk


Using FIMI to find the itemsets Frequent Itemset Mining (FIMI)

Find frequent itemsets with support > minsupp Minimal support minsupp given by “an expert”

First Algorithm: APriori [Agrawal et al. 93] Basic Idea: If an itemset is frequent, all its subsets must be

frequent (Monotonicity) k-passes over dataset for itemsets of length k ~O(knp) n transactions, p items, itemsets of length k

Today’s algorithms Rely on the same basic principle But much faster (Main Reason: Data structures)

• Usually only 2 database passes• ~linear runtime

State-of-the-art algorithm overview: FIMI’03 We used: fpmax* [Grahne, Zhu: Nov 03]


Diapers and Beer !!? Application to the domain of

Multimedia data:1. Formulate images as transactions2. Low-level clusters serve as a

dimensionality reduction for the visual features

3. We find associations of visual features (clusters) and keywords

4. From theses associations we deduce semantic rules

Advantages Comparably low computational

complexity Other data sources can be

integrated in the same manner (e.g. long-term relevance feedback)

Challenges Noisy, uncontrolled data Associations within keywords much

stronger than associations between keywords and visual features

Uneven distribution of cluster sizes (K-Means problem)


Characteristics of the Itemsets and Rules There are associations

Within text {shoe} {walk} Within visual clusters {EHD 14}

{SCD 12} Between text and visual

clusters {shoe} {EHD 14} Measure for interestingness or

choice of rules from FI Confidence? Statistical Criteria? Background Knowledge?

(Example: pregnant -> Woman: 100% confidence)

Our „Background Knowledge“: Rules that connect keywords and low-level features are more interesting Since this is known, the mining

can be adapted and made even faster


Exploiting the Itemsets and Rules


Selecting Interesting Low-Level Clusters based on Rules Clusters were introduced to partition the

visual feature vector data and search only on certain clusters

Problem: We miss certain nearest neighbors if images for a concept are spread over several clusters

Unsatisfactory Solution: Overlapping Clusters

But association rules might find and solve this situation

Clusters are re-united If number of images for concept in both

clusters is >minsupp Example:

{shirt} -> {ehd249,ehd310} reunites these clusters for the initial keyword-query “shirt”!

This is scalable - unlike overlapping clusters

Another benefit is that more images labeled with the original keyword are “injected” into the results of K-NN search

Currently: One Keyword as high level semantic concept

Future: Find high level semantic concepts by mining associations within text first


The Visual Link Another contribution, NOT related to Frequent Itemset

Mining and Association Rules… Since search-concept suggests visual nearest neighbor

search with relevance feedback after intitial keyword search: It would be nice to have a diverse selection of images for a

given keyword on the first page of results Images sorted not only by keyword ranking, but also based

on visual feature information Basic idea: For a given keyword query, build groups of

images that are visually close. Larger groups are more important Show only one representative per group


The Visual Link: A Graph-Based Approach Let I(Q) be a set of images matching a keyword query Q Define a graph G(V,E)

i.e. images are visually linked if the distance between them is lower than a given threshold

Do a connected component analysis to find connected components C

For each component C find the „best“ representative rC

Re-rank results based on representatives rC


The Visual Link: An Example


The Visual Link: An Approximation Problem: Distance calculations

for graph take too long Clusters cannot be used Loading individual vectors

takes a lot of time Solution:

Approximate distance Idea: If images in the same

cluster and same distance range to the centroid Probability that they are „close“ is high

New definition for visually linked If in same cluster and same

range of relative distance to its centroid

Can be encoded in relational DB! And comes at nearly no extra cost in creation

Imageid Clusterid 2ndClusterid Reldist

1 221 122 0.6

2 342 345 0.8

3 223 42 0.2

4 12 126 0.4


Discussion & Demo


Discussion: Precision Measuring the quality of

such a large-scale system is difficult Precision/Recall measure

not possible: ground truth not known

• C: correct results• D: Desired results• A: Actual results

We measure the precision based on user questioning


Before we continue … some numbers

Number of Images: 3 006 660 Size of Image data: 111 GB Feature Extraction: 15 days (dual 2Ghz CPU, 2GB RAM)

Number of distinct keywords: 680 256 Size of inverted keyword index table: 50 260 345 lines MySQL database size: 23 GB


And now … the moment you’ve all been waiting for …

The Demo of Cortina


Conclusions A system with over 3 Million items was implemented

Probably the largest CBIR System to date? A retrieval concept was introduced

a keyword query followed by relevance feedback and visual nearest neighbour search

Superior to existing retrieval concepts (query by keyword or query by example)

Data mining to explore and exploit semantics in large-scale systems was introduced


Questions


Outlook Many extensions and improvements possible

Segmentation• Or maybe rather some simple tiling

Indexing• K-Means should be replaced• Suggestion: VA-File based approach [Manjunath,Tesic 03]

Association Rule Mining• Multilevel Approach• First keywords for high level semantic concepts• Then visual features


Thanks Ullrich Moenich and Lars Thiele


Which Rules are of Interest? There are associations

Within text {shoe} {walk} Within visual clusters {EHD 14} {SCD 12} Between text and visual clusters {shoe} {EHD 14, SCD

12} There are long and short rules

Short rules have higher support by the nature of the problem

Long rules contain more (precise) information about the semantics

Measure for interestingness or choice of rules from FI Confidence? Statistical Criteria? Background Knowledge? (Example pregnant Woman )


Characteristics and Challenges Chosen criteria

Mainly interested in rules {keywords} {visual feature clusters}. (Our “Background Knowledge”)

Support, confidence Mine long and short rules

Restriction of the problem: Mine for frequent itemsets per keyword i.e. all images=transactions for a given keyword This means

• We avoid being distracted by associations within keywords• The method is made even more scalable

The keyword as a placeholder for a semantic concept A keyword does not always stand for a single semantic concept Proposal for future versions: Multi-Level approach:

• First {keywords} {keywords} rules to identify “real” semantic concepts

• Then itemset mining per identified concept


Characteristics of the Itemsets and Rules - Overall


Why keyword filtering of the results does not work


Proposal: Semantic Clusters Ultimate goal: Search some

kind of „Semantic Clusters“ instead of visual feature clusters

Proposal based on approach from Ester et al. 2002, 2003

Clustering based on frequent itemsets, originally for text

Clustering criterion: minimize overlap