Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Content-based image andgvideo analysis

Image and Video SemanticsImage and Video Semantics

14.06.2010

Levels of image/video/retrievalg

Level 1: Based on color, texture, shape featurespImages are compared based on low-level features, no semantics involved

f fA lot of research done, is a feasible taskLevel 2: Bring semantic meanings into search

e g identif ing h man beings horses trees beachese.g. identifying human beings, horses, trees, beachesRequires retrieval techniques of level 1

Level 3: Retrieval with abstract and subjective attributesLevel 3: Retrieval with abstract and subjective attributesFind pictures of a particular birthday celebrationFind a picture of a happy beautiful womand a p ctu e o a appy beaut u o aRequires retrieval techniques of level 2 and very complex logic

Content-based image and video retrieval 2

Semantic gapg p

“lack of coincidence between the informationlack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have forthe interpretation that the same data have for a user in a given situation” (Smeulders et al., 2000)2000)“Semantic gap” = gap between level 1 and level 2level 2

This lecture:Overview of techniques to “bridge” this gap


What is semantics?

Semantics = meaningSemantics meaningBUT: What is the semantics of this picture?


What is semantics?

What is the semanticsWhat is the semantics of this picture?

“Upper-right uniform light-blue region and upper-left black region and middle dark blue region and uniformblack region and middle dark blue region and uniform lower light-brown region”?Sea, sky, sand, mountains, waves, boats?Beach with mountains in the background?Sunset at the beach?Rio de Janeiro, sugar loaf?…


Challengesg

Semantics has many degrees of granularitySemantics has many degrees of granularity

Semantics is in the e e of the beholderSemantics is in the eye of the beholder

Different ways to express the same

When attempting to bridge the semantic gap, CBIR systems also should pay attention to allCBIR systems also should pay attention to all those points


Ways to bridge the semantic gapy g g p

1. Machine learning1. Machine learning supervisedunsupervisedunsupervised

2. Ontology based3 R l f db k3. Relevance feedback4. Web image retrieval

…

Usually a combination of those approaches



1 Machine learning1. Machine learning

2. Ontology based

3. Relevance feedback

4 Web image retrieval4. Web image retrieval


Machine learningg

Basic steps:Basic steps:Convert image to low-level features (color/edge histograms, wavelets, DCT, etc.)g , , , )Use machine learning algorithm to map features to image/video semantics (SVM, ANN, CART, etc.)

Examples:Examples:Genre ClassificationHigh-level Feature DetectionHigh level Feature DetectionPerson IdentificationEvent RecognitionEvent Recognition


Machine learningg

Supervised learningSupervised learningSemantic concept of training data samples is known in advance

Unsupervised learningUnsupervised learningNo prior knowledge availableE g clustering of data based on some similarityE.g. clustering of data based on some similarity measureSemantics can be assigned manually to clusterSemantics can be assigned manually to cluster


Machine learningg

Problems:Problems:Lots of (annotated) training data needed

Bootstrappingpp gHuman Computation (ESP game, Google Image Labeler -> see “Tools” lecture)

For each object-class/concept a new detector needs to be learned

Instead of training specific detectors exploit objectInstead of training specific detectors exploit object similarity to translate images to words directly (Examples at the end of this lecture)

Limited query vocabularyLink query vocabulary to lexical database (ontologies)




2. Ontology based




Ontology in Philosophygy p y

Philosophical discipline that deals with thePhilosophical discipline that deals with the nature and the organization of realityAlso known as Aristotle’s MetaphysicsAlso known as Aristotle s MetaphysicsTries to answer the question

Wh t i b i ?What is being?What are the features common to all beings?

Representation of entities and events along p gwith their properties, relations and properties

Similar to Ontology in Computer Sciencegy p


Ontology in Computer Sciencegy p

“formal, explicit specification of a sharedformal, explicit specification of a shared conceptualisation” (Tom Gruber, 1993)

Formal representation of concepts within a domain and the relationships between thedomain and the relationships between the concepts


Why use ontologies?y g

LabelingLabelingIf one says “cat” and the other “feline”, how is system to know that both are the same? y

SemanticsSemanticsHow should the system know that “lions”, “tigers” and “house cats” are all cats?and house cats are all cats?

Knowledge sharing and reuseKnowledge sharing and reuseNeed to be able to create definitions of terms in a machine-understandable formatmachine understandable format


Ontology componentsgy p

Concepts RelationshipsConceptsCat, dog

RelationshipsIs a, part of

PropertiesL h

AxiomsCats cannot eat onlyLength, age Cats cannot eat only vegetation

ConstraintsMaximum value Individuals

G fi ldGarfield as an instance of cat


Example ontologyp gy

Simple ontology with “is a” relationship:Simple ontology with is a relationship:

Instances:Instances:


Ontology: WordNetgy

Lexical database for the English languageLexical database for the English languagehttp://wordnet.princeton.edu150K ords organi ed in >115K s nsets150K words organized in >115K synsets(group of semantically equivalent elements)Most synsets connected via semantic relationsNouns and verbs organized into hierarchies

Can be interpreted as an ontology


Ontology: WordNetgy

Directed “is a” relationships:Directed is a relationships:X is a hypernym of Y if every Y is a kind of X

Cat is a hypernym of lionyp y

X is hyponym of Y if every X is a kind of YLion is a hyponym of cat

Directed “part of” relationshipsX is holonym of Y if Y is part of XX is holonym of Y if Y is part of X

Cat is a holonym of fur

X is meronym of Y if X is part of YX is meronym of Y if X is part of YFur is a meronym of cat


Ontologygy

One way to bridge the semantic gap is to linkOne way to bridge the semantic gap is to link general purpose ontologies (eg. WordNet) to detectorsdetectors

Example:Example:


Object ontologyj gy

Another way to use an ontology to bridge theAnother way to use an ontology to bridge the semantic gap

Derive semantics from daily language:“ k ” b d ib d “ if d“sky” can be described as “upper, uniform and blue region”

Describe query and image objects in k l d b ith l iti i dknowledge-base with color, position, size and shape representation from object ontology


Object Ontologyj gy




2. Ontology based




Relevance feedback

Bring user in the retrieval loop to reduce theBring user in the retrieval loop to reduce the semantic gap between what queries represent and what the user thinks:represent and what the user thinks:

System provides initial retrieval results for query User judges relevance of the resultsUser judges relevance of the resultsMachine learning algorithm is applied to user’s feedback to refine the search

See next lecture for more ☺See next lecture for more ☺Content-based image and video retrieval 24



2. Ontology based




Web based image retrievalg

Popular web-search engines offer an imagePopular web search engines offer an image search function

Search only based on textual evidencesSearch only based on textual evidencesUnable to confirm if retrieved images really contain desired conceptsp

Basic idea:Use text-based classifier to detect concepts withinUse text-based classifier to detect concepts within textUse visual-based classifier to detect conceptsUse visual based classifier to detect concepts within imagesFuse both classification results


Conclusion

ML only allows a limited query vocabularyML only allows a limited query vocabularyIt is possible to lower the limitations significantly by using an ontologysignificantly by using an ontologyEach of the named techniques can not fully bridge the semantic gapbridge the semantic gap

Modern CBIR systems usually consist of a combination of some of the introduced techniques


Example Systemsp y

Automatic image annotationAutomatic image annotationGoal: Given an un-annotated image, automatically assign meaningful keywordsg g yApproaches

Co-occurrence model (Mori et al. 1999)Translation model (Duygulu et al. 2002)Cross-media Relevance model (Jeon et al. 2003)

Link lexical ontology to high level detectorsMediaMill – Semantic Video Search Engine (Snoeket al. 2007)


Automatic image annotationg

Basic procedure:Basic procedure:1. Training data consists of many images with key

wordswords

Corel stock photo CDsCorel stock photo CDs




wordswords2. Divide each image into regions and extract

features from themfeatures from themUniform gridBlobworldBlobworldNormalized-cuts





features from themfeatures from them3. Cluster image regions

Cluster centers = blobsCluster centers = blobs





features from themfeatures from them3. Cluster image regions4 Approximate PDF to model relationship4. Approximate PDF to model relationship

between images (blobs) and wordsCo occurrence modelCo-occurrence modelTranslation modelCross-media relevance modelCross media relevance model



For each unknown image:For each unknown image:1. Divide image into regions

2. Describe each region with index of nearest blob

3. Use learned PDF to translate blobs to words


Co-occurrence model

Idea:Idea:One image has the key words “sky” and “beach”Another image has the key words “sky” andAnother image has the key words sky and “mountain”Each region inherits all words from its imageac eg o e s a o ds o s ageAccumulate information from both images

Regions associated to same blob share all their wordsg

Now, the sky regions have two words “sky”, one word “beach” and one word “mountain”

Image-to-word transformation based on dividing and vector quantizing images with words


(Mori et al. 1999)

Co-occurrence model

1. Use many images with key words for learningy g y g2. Divide each image into parts, extract features3. Each part inherits all words from its imagep g4. Find clusters from all divided images5. Accumulate the frequencies of words of all parts in5. Accumulate the frequencies of words of all parts in

each cluster, and calculate the likelihood for every word

6. For an unknown image, divide it into parts, extract features, find nearest clusters for all parts. Combine th lik lih d f th i l t d d t i i hthe likelihoods of their clusters and determine wichwords are most plausible


Co-occurrence model

Building the model:Building the model:Approximate for each word and each blob :

Bayes formulaBayes formula


Co-occurrence model

Building the model:Building the model:Approximate for each word and each blob :

Total of wordin blob

Total of all wordsf ll d t

Total of wordin all data

for all data


in all data

Co-occurrence model


Co-occurrence model

Annotating an unknown image:Annotating an unknown image:Average the likelihoods of all blobs in the imageg

Words with largest average likelihood value are o ds a ges a e age e ood a ue a eestimated to be image labels


Co-occurrence model


Co-occurrence model: some results


Top three results. Bold words match manual annotations.

Translation model

View image annotation as a task of translating from a g gvocabulary of blobs (L1) to a vocabulary of words (L2)

Aligned “texts” in both languages (blobs, words) are available (“ li d bit t ”)(“aligned bitexts”)Here: Words associated to whole images are known

Learning a lexicon for L1-> L2 from aligned bitexts is a standard problem in machine translationp

Goal: determine precise correspondences between the words in both languages H Whi h d ith hi h i i ?Here: Which word goes with which image region?

Object Recognition as Machine Translation:L i L i f Fi d I V b l


Learning a Lexicon for a Fixed Image Vocabulary(Duygulu et al. 2002)

Correspondences between blobs and wordsp

Associated keywords and segments are availablePrecise correspondence between words and segmentsPrecise correspondence between words and segments are missing

Goal: Find correspondences between segments and wordswords


Translation model

Building the model:Building the model:Use EM to estimate the parameters of the likelihoodlikelihood

E step: Use an estimate of to predict the word to blob correspondencesM step: use the correspondences to refine the estimate of Initialization: random values


Translation model

Building the model:Building the model:Use EM to estimate the parameters of the likelihoodlikelihood Need to find

P( i) P b bilit th t i i bl b b iP(anj=i): Probability that in an image n, a blob bi is associated with word wj

t(w|b): Probability of obtaining word w given blob bt(w|b): Probability of obtaining word w given blob b


Translation model

Post-processing:Post processing:Cluster visually indistinguishable words

cat tiger eagle jetcat – tiger, eagle – jetAssign a null-word to all blobs whose

b bilit t b i t d ith d iprobability to be associated with a word is too small

Th h ld l l d lid iThreshold value learned on validation setFit a new lexicon with a reduced vocabulary after thresholding since some words may never be predicted with sufficient probability


Translation model

Some good resultsSome good results

Null words

Clustering


Cross-Media Relevance model

Relevance model = underlying PDF P(.|I) forRelevance model underlying PDF P( |I) for each image

Can be thought as an urn containing all possibleCan be thought as an urn containing all possible blobs and key words that could appear in I

Annotate image by sampling words from P(.|I)Annotate image by sampling words from P( |I)

Probability of observing any word wi whenProbability of observing any word wi when sampling from P(.|I) needs to be known (i.e. P(wi|I)P(wi|I)

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models


(Jeon et al. 2003)


Approximate P(wi|I) by the probability of observing pp ( | ) y p y gwi given a previously observed set of blobsb1,…,bm:, ,

use of maximum-likelihood estimator not possible, since no annotation of blobs is available in thesince no annotation of blobs is available in the images Use training set of images to estimateUse training set of images to estimate and then marginalize the PDF with respect to wi



Difference to co-occurrence and translationDifference to co occurrence and translation model:

Instead of learning one to one correspondencesInstead of learning one to one correspondences CMRM learns a joint PDF to model the correspondence between a set of words and a set of regions

takes context on blob level into accounteg. tiger+grass is more likely than tiger+beachg g g y g



Three different models possible:Three different models possible:1. Probabilistic CMRM (PACMRM):

Return all word probabilitiesReturn all word probabilitiesGood for ranked retrieval

2 Fi d t ti b d CMRM (FACMRM)2. Fixed annotation based CMRM (FACMRM):Report only the N best wordsGood for people to look at

3. Direct-retrieval CMRM (DRCMRM):Use model to translate words to blobs and look for similar regions in each test image



Performance of the two ranked retrievalPerformance of the two ranked retrieval models



Automatic annotations (best four words)Automatic annotations (best four words) compared to manual annotations:


Automatic annotation - Results

Precision:Precision:



Recall:Recall:



Mean Precision Mean RecallMean PrecisionCo-occurrence model:

Mean RecallCo-occurrence model:

0.07 0.11

Translation model:0.14

Translation model:0.24

CRMRM:0 33

CRMRM:0 370.33 0.37


Automatic annotation – Conclusion

Easier to train than dedicated conceptEasier to train than dedicated concept detectors (no tuning needed)Allows a fine degree of semantic granularityAllows a fine degree of semantic granularity

Results may contain noise or not represent users intention:

Use relevance feedbackOnly words from training vocabulary allowed y g yas query

Use ontologygy


Adding Semantics to Detectors for Video Retrieval

Snoek et al., IEEE Transactions on Multimedia 2007


MediaMill

System consists of 101 machine learnedSystem consists of 101 machine learned high-level feature (=concepts) detectors

key-frame based video classificationkey frame based video classification

G l ll th t t l lGoal: allow the user to use natural language queries instead restricting him to limited

b l (l d t )vocabulary (learned concepts)

Idea: link WordNet to the concept-detector set


Semantically enriched detectory

Each detector is associated with:Each detector is associated with:Manually created textual description

Storms “outdoor scenes with stormy weatherStorms – outdoor scenes with stormy weather, thunderstorms, lightning”

Links to WordNetLinks to WordNetManually map description to WordNet synset(=group of semantically equivalent elements)(=group of semantically equivalent elements) descriptions

Visual modelVisual modelEstimates confidence to indicate if a concept is present in a shotpresent in a shot


MediaMill

Detector selection strategies:Detector selection strategies:Text matching:

Select detector with highest similarity between query and g y q ydetector description

Ontology querying:Translate query to ontological conceptsQuery WordNet to determine which detector is most related to original queryrelated to original queryConcept “vehicle” is also defined by occurrences of its sub-concepts like “car”, “truck” etc.

Semantic visual queryingUse all concept detectors to classify concept of query images and select most likely detectorimages and select most likely detector


Results

Average Precision:Average Precision: Text Matching:50.8 %Ontology Querying: 56 0 %Ontology Querying: 56.0 %Visual Querying: 55.6 %

Average Precision after fusing selection strategies (linear combination of the results)strategies (linear combination of the results)

Text + Ontology: 65.5 %T t + Vi l 72 4 %Text + Visual: 72.4 %Ontology + Visual: 75.9 %All 83 4 %All: 83.4 %


Lessons learned

Bridging the semantic gap is a major problemBridging the semantic gap is a major problem of CBIR

A plethora of approaches exist to deal with thisA plethora of approaches exist to deal with this problem

Automatic annotation algorithms allow a fine degree of semantical granularitydegree of semantical granularity

Linking detectors to lexical databases allows natural language querying


References

“Image-to-word transformation based on dividing and vectorImage to word transformation based on dividing and vector quantizing images with words”, Mori et al., MISRM 1999

“Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary”, Duygulu et al., ECCV 2002

“Automatic Image Annotation and Retrieval using Cross-Media R l M d l ” J t l ACM SIGR 2003Relevance Models”, Jeon et al., ACM SIGR 2003

“Addi S ti t D t t f Vid R t i l” S k t l“Adding Semantics to Detectors for Video Retrieval”, Snoek et al., IEEE Transactions on Multimedia, 2007


Documents

Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,