64
Content-based image and video analysis Image and Video Semantics Image and Video Semantics 14.06.2010

Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Content-based image andgvideo analysis

Image and Video SemanticsImage and Video Semantics

14.06.2010

Page 2: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Levels of image/video/retrievalg

Level 1: Based on color, texture, shape featurespImages are compared based on low-level features, no semantics involved

f fA lot of research done, is a feasible taskLevel 2: Bring semantic meanings into search

e g identif ing h man beings horses trees beachese.g. identifying human beings, horses, trees, beachesRequires retrieval techniques of level 1

Level 3: Retrieval with abstract and subjective attributesLevel 3: Retrieval with abstract and subjective attributesFind pictures of a particular birthday celebrationFind a picture of a happy beautiful womand a p ctu e o a appy beaut u o aRequires retrieval techniques of level 2 and very complex logic

Content-based image and video retrieval 2

Page 3: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Semantic gapg p

“lack of coincidence between the informationlack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have forthe interpretation that the same data have for a user in a given situation” (Smeulders et al., 2000)2000)“Semantic gap” = gap between level 1 and level 2level 2

This lecture:Overview of techniques to “bridge” this gap

Content-based image and video retrieval 3

Page 4: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

What is semantics?

Semantics = meaningSemantics meaningBUT: What is the semantics of this picture?

Content-based image and video retrieval 4

Page 5: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

What is semantics?

What is the semanticsWhat is the semantics of this picture?

“Upper-right uniform light-blue region and upper-left black region and middle dark blue region and uniformblack region and middle dark blue region and uniform lower light-brown region”?Sea, sky, sand, mountains, waves, boats?Beach with mountains in the background?Sunset at the beach?Rio de Janeiro, sugar loaf?…

Content-based image and video retrieval 5

Page 6: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Challengesg

Semantics has many degrees of granularitySemantics has many degrees of granularity

Semantics is in the e e of the beholderSemantics is in the eye of the beholder

Different ways to express the same

When attempting to bridge the semantic gap, CBIR systems also should pay attention to allCBIR systems also should pay attention to all those points

Content-based image and video retrieval 6

Page 7: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ways to bridge the semantic gapy g g p

1. Machine learning1. Machine learning supervisedunsupervisedunsupervised

2. Ontology based3 R l f db k3. Relevance feedback4. Web image retrieval

Usually a combination of those approaches

Content-based image and video retrieval 7

Page 8: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ways to bridge the semantic gapy g g p

1 Machine learning1. Machine learning

2. Ontology based

3. Relevance feedback

4 Web image retrieval4. Web image retrieval

Content-based image and video retrieval 8

Page 9: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Machine learningg

Basic steps:Basic steps:Convert image to low-level features (color/edge histograms, wavelets, DCT, etc.)g , , , )Use machine learning algorithm to map features to image/video semantics (SVM, ANN, CART, etc.)

Examples:Examples:Genre ClassificationHigh-level Feature DetectionHigh level Feature DetectionPerson IdentificationEvent RecognitionEvent Recognition

Content-based image and video retrieval 9

Page 10: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Machine learningg

Supervised learningSupervised learningSemantic concept of training data samples is known in advance

Unsupervised learningUnsupervised learningNo prior knowledge availableE g clustering of data based on some similarityE.g. clustering of data based on some similarity measureSemantics can be assigned manually to clusterSemantics can be assigned manually to cluster

Content-based image and video retrieval 10

Page 11: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Machine learningg

Problems:Problems:Lots of (annotated) training data needed

Bootstrappingpp gHuman Computation (ESP game, Google Image Labeler -> see “Tools” lecture)

For each object-class/concept a new detector needs to be learned

Instead of training specific detectors exploit objectInstead of training specific detectors exploit object similarity to translate images to words directly (Examples at the end of this lecture)

Limited query vocabularyLink query vocabulary to lexical database (ontologies)

Content-based image and video retrieval 11

Page 12: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ways to bridge the semantic gapy g g p

1 Machine learning1. Machine learning

2. Ontology based

3. Relevance feedback

4 Web image retrieval4. Web image retrieval

Content-based image and video retrieval 12

Page 13: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ontology in Philosophygy p y

Philosophical discipline that deals with thePhilosophical discipline that deals with the nature and the organization of realityAlso known as Aristotle’s MetaphysicsAlso known as Aristotle s MetaphysicsTries to answer the question

Wh t i b i ?What is being?What are the features common to all beings?

Representation of entities and events along p gwith their properties, relations and properties

Similar to Ontology in Computer Sciencegy p

Content-based image and video retrieval 13

Page 14: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ontology in Computer Sciencegy p

“formal, explicit specification of a sharedformal, explicit specification of a shared conceptualisation” (Tom Gruber, 1993)

Formal representation of concepts within a domain and the relationships between thedomain and the relationships between the concepts

Content-based image and video retrieval 14

Page 15: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Why use ontologies?y g

LabelingLabelingIf one says “cat” and the other “feline”, how is system to know that both are the same? y

SemanticsSemanticsHow should the system know that “lions”, “tigers” and “house cats” are all cats?and house cats are all cats?

Knowledge sharing and reuseKnowledge sharing and reuseNeed to be able to create definitions of terms in a machine-understandable formatmachine understandable format

Content-based image and video retrieval 15

Page 16: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ontology componentsgy p

Concepts RelationshipsConceptsCat, dog

RelationshipsIs a, part of

PropertiesL h

AxiomsCats cannot eat onlyLength, age Cats cannot eat only vegetation

ConstraintsMaximum value Individuals

G fi ldGarfield as an instance of cat

Content-based image and video retrieval 16

Page 17: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Example ontologyp gy

Simple ontology with “is a” relationship:Simple ontology with is a relationship:

Instances:Instances:

Content-based image and video retrieval 17

Page 18: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ontology: WordNetgy

Lexical database for the English languageLexical database for the English languagehttp://wordnet.princeton.edu150K ords organi ed in >115K s nsets150K words organized in >115K synsets(group of semantically equivalent elements)Most synsets connected via semantic relationsNouns and verbs organized into hierarchies

Can be interpreted as an ontology

Content-based image and video retrieval 18

Page 19: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ontology: WordNetgy

Directed “is a” relationships:Directed is a relationships:X is a hypernym of Y if every Y is a kind of X

Cat is a hypernym of lionyp y

X is hyponym of Y if every X is a kind of YLion is a hyponym of cat

Directed “part of” relationshipsX is holonym of Y if Y is part of XX is holonym of Y if Y is part of X

Cat is a holonym of fur

X is meronym of Y if X is part of YX is meronym of Y if X is part of YFur is a meronym of cat

Content-based image and video retrieval 19

Page 20: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ontologygy

One way to bridge the semantic gap is to linkOne way to bridge the semantic gap is to link general purpose ontologies (eg. WordNet) to detectorsdetectors

Example:Example:

Content-based image and video retrieval 20

Page 21: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Object ontologyj gy

Another way to use an ontology to bridge theAnother way to use an ontology to bridge the semantic gap

Derive semantics from daily language:“ k ” b d ib d “ if d“sky” can be described as “upper, uniform and blue region”

Describe query and image objects in k l d b ith l iti i dknowledge-base with color, position, size and shape representation from object ontology

Content-based image and video retrieval 21

Page 22: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Object Ontologyj gy

Content-based image and video retrieval 22

Page 23: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ways to bridge the semantic gapy g g p

1 Machine learning1. Machine learning

2. Ontology based

3. Relevance feedback

4 Web image retrieval4. Web image retrieval

Content-based image and video retrieval 23

Page 24: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Relevance feedback

Bring user in the retrieval loop to reduce theBring user in the retrieval loop to reduce the semantic gap between what queries represent and what the user thinks:represent and what the user thinks:

System provides initial retrieval results for query User judges relevance of the resultsUser judges relevance of the resultsMachine learning algorithm is applied to user’s feedback to refine the search

See next lecture for more ☺See next lecture for more ☺Content-based image and video retrieval 24

Page 25: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Ways to bridge the semantic gapy g g p

1 Machine learning1. Machine learning

2. Ontology based

3. Relevance feedback

4 Web image retrieval4. Web image retrieval

Content-based image and video retrieval 25

Page 26: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Web based image retrievalg

Popular web-search engines offer an imagePopular web search engines offer an image search function

Search only based on textual evidencesSearch only based on textual evidencesUnable to confirm if retrieved images really contain desired conceptsp

Basic idea:Use text-based classifier to detect concepts withinUse text-based classifier to detect concepts within textUse visual-based classifier to detect conceptsUse visual based classifier to detect concepts within imagesFuse both classification results

Content-based image and video retrieval 26

Page 27: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Conclusion

ML only allows a limited query vocabularyML only allows a limited query vocabularyIt is possible to lower the limitations significantly by using an ontologysignificantly by using an ontologyEach of the named techniques can not fully bridge the semantic gapbridge the semantic gap

Modern CBIR systems usually consist of a combination of some of the introduced techniques

Content-based image and video retrieval 27

Page 28: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Example Systemsp y

Automatic image annotationAutomatic image annotationGoal: Given an un-annotated image, automatically assign meaningful keywordsg g yApproaches

Co-occurrence model (Mori et al. 1999)Translation model (Duygulu et al. 2002)Cross-media Relevance model (Jeon et al. 2003)

Link lexical ontology to high level detectorsMediaMill – Semantic Video Search Engine (Snoeket al. 2007)

Content-based image and video retrieval 28

Page 29: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic image annotationg

Basic procedure:Basic procedure:1. Training data consists of many images with key

wordswords

Corel stock photo CDsCorel stock photo CDs

Content-based image and video retrieval 29

Page 30: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic image annotationg

Basic procedure:Basic procedure:1. Training data consists of many images with key

wordswords2. Divide each image into regions and extract

features from themfeatures from themUniform gridBlobworldBlobworldNormalized-cuts

Content-based image and video retrieval 30

Page 31: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic image annotationg

Basic procedure:Basic procedure:1. Training data consists of many images with key

wordswords2. Divide each image into regions and extract

features from themfeatures from them3. Cluster image regions

Cluster centers = blobsCluster centers = blobs

Content-based image and video retrieval 31

Page 32: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic image annotationg

Basic procedure:Basic procedure:1. Training data consists of many images with key

wordswords2. Divide each image into regions and extract

features from themfeatures from them3. Cluster image regions4 Approximate PDF to model relationship4. Approximate PDF to model relationship

between images (blobs) and wordsCo occurrence modelCo-occurrence modelTranslation modelCross-media relevance modelCross media relevance model

Content-based image and video retrieval 32

Page 33: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic image annotationg

For each unknown image:For each unknown image:1. Divide image into regions

2. Describe each region with index of nearest blob

3. Use learned PDF to translate blobs to words

Content-based image and video retrieval 33

Page 34: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

Idea:Idea:One image has the key words “sky” and “beach”Another image has the key words “sky” andAnother image has the key words sky and “mountain”Each region inherits all words from its imageac eg o e s a o ds o s ageAccumulate information from both images

Regions associated to same blob share all their wordsg

Now, the sky regions have two words “sky”, one word “beach” and one word “mountain”

Image-to-word transformation based on dividing and vector quantizing images with words

Content-based image and video retrieval 34

(Mori et al. 1999)

Page 35: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

1. Use many images with key words for learningy g y g2. Divide each image into parts, extract features3. Each part inherits all words from its imagep g4. Find clusters from all divided images5. Accumulate the frequencies of words of all parts in5. Accumulate the frequencies of words of all parts in

each cluster, and calculate the likelihood for every word

6. For an unknown image, divide it into parts, extract features, find nearest clusters for all parts. Combine th lik lih d f th i l t d d t i i hthe likelihoods of their clusters and determine wichwords are most plausible

Content-based image and video retrieval 35

Page 36: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

Building the model:Building the model:Approximate for each word and each blob :

Bayes formulaBayes formula

Content-based image and video retrieval 36

Page 37: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

Building the model:Building the model:Approximate for each word and each blob :

Total of wordin blob

Total of all wordsf ll d t

Total of wordin all data

for all data

Content-based image and video retrieval 37

in all data

Page 38: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

Content-based image and video retrieval 38

Page 39: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

Annotating an unknown image:Annotating an unknown image:Average the likelihoods of all blobs in the imageg

Words with largest average likelihood value are o ds a ges a e age e ood a ue a eestimated to be image labels

Content-based image and video retrieval 39

Page 40: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model

Content-based image and video retrieval 40

Page 41: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Co-occurrence model: some results

Content-based image and video retrieval 41

Top three results. Bold words match manual annotations.

Page 42: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Translation model

View image annotation as a task of translating from a g gvocabulary of blobs (L1) to a vocabulary of words (L2)

Aligned “texts” in both languages (blobs, words) are available (“ li d bit t ”)(“aligned bitexts”)Here: Words associated to whole images are known

Learning a lexicon for L1-> L2 from aligned bitexts is a standard problem in machine translationp

Goal: determine precise correspondences between the words in both languages H Whi h d ith hi h i i ?Here: Which word goes with which image region?

Object Recognition as Machine Translation:L i L i f Fi d I V b l

Content-based image and video retrieval 42

Learning a Lexicon for a Fixed Image Vocabulary(Duygulu et al. 2002)

Page 43: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Correspondences between blobs and wordsp

Associated keywords and segments are availablePrecise correspondence between words and segmentsPrecise correspondence between words and segments are missing

Goal: Find correspondences between segments and wordswords

Content-based image and video retrieval 43

Page 44: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Translation model

Building the model:Building the model:Use EM to estimate the parameters of the likelihoodlikelihood

E step: Use an estimate of to predict the word to blob correspondencesM step: use the correspondences to refine the estimate of Initialization: random values

Content-based image and video retrieval 44

Page 45: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Translation model

Building the model:Building the model:Use EM to estimate the parameters of the likelihoodlikelihood Need to find

P( i) P b bilit th t i i bl b b iP(anj=i): Probability that in an image n, a blob bi is associated with word wj

t(w|b): Probability of obtaining word w given blob bt(w|b): Probability of obtaining word w given blob b

Content-based image and video retrieval 45

Page 46: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Translation model

Post-processing:Post processing:Cluster visually indistinguishable words

cat tiger eagle jetcat – tiger, eagle – jetAssign a null-word to all blobs whose

b bilit t b i t d ith d iprobability to be associated with a word is too small

Th h ld l l d lid iThreshold value learned on validation setFit a new lexicon with a reduced vocabulary after thresholding since some words may never be predicted with sufficient probability

Content-based image and video retrieval 46

Page 47: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Translation model

Some good resultsSome good results

Null words

Clustering

Content-based image and video retrieval 47

Page 48: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Cross-Media Relevance model

Relevance model = underlying PDF P(.|I) forRelevance model underlying PDF P( |I) for each image

Can be thought as an urn containing all possibleCan be thought as an urn containing all possible blobs and key words that could appear in I

Annotate image by sampling words from P(.|I)Annotate image by sampling words from P( |I)

Probability of observing any word wi whenProbability of observing any word wi when sampling from P(.|I) needs to be known (i.e. P(wi|I)P(wi|I)

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models

Content-based image and video retrieval 48

(Jeon et al. 2003)

Page 49: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Cross-Media Relevance model

Approximate P(wi|I) by the probability of observing pp ( | ) y p y gwi given a previously observed set of blobsb1,…,bm:, ,

use of maximum-likelihood estimator not possible, since no annotation of blobs is available in thesince no annotation of blobs is available in the images Use training set of images to estimateUse training set of images to estimate and then marginalize the PDF with respect to wi

Content-based image and video retrieval 49

Page 50: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Cross-Media Relevance model

Difference to co-occurrence and translationDifference to co occurrence and translation model:

Instead of learning one to one correspondencesInstead of learning one to one correspondences CMRM learns a joint PDF to model the correspondence between a set of words and a set of regions

takes context on blob level into accounteg. tiger+grass is more likely than tiger+beachg g g y g

Content-based image and video retrieval 50

Page 51: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Cross-Media Relevance model

Three different models possible:Three different models possible:1. Probabilistic CMRM (PACMRM):

Return all word probabilitiesReturn all word probabilitiesGood for ranked retrieval

2 Fi d t ti b d CMRM (FACMRM)2. Fixed annotation based CMRM (FACMRM):Report only the N best wordsGood for people to look at

3. Direct-retrieval CMRM (DRCMRM):Use model to translate words to blobs and look for similar regions in each test image

Content-based image and video retrieval 51

Page 52: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Cross-Media Relevance model

Performance of the two ranked retrievalPerformance of the two ranked retrieval models

Content-based image and video retrieval 52

Page 53: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Cross-Media Relevance model

Automatic annotations (best four words)Automatic annotations (best four words) compared to manual annotations:

Content-based image and video retrieval 53

Page 54: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic annotation - Results

Precision:Precision:

Content-based image and video retrieval 54

Page 55: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic annotation - Results

Recall:Recall:

Content-based image and video retrieval 55

Page 56: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic annotation - Results

Mean Precision Mean RecallMean PrecisionCo-occurrence model:

Mean RecallCo-occurrence model:

0.07 0.11

Translation model:0.14

Translation model:0.24

CRMRM:0 33

CRMRM:0 370.33 0.37

Content-based image and video retrieval 56

Page 57: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Automatic annotation – Conclusion

Easier to train than dedicated conceptEasier to train than dedicated concept detectors (no tuning needed)Allows a fine degree of semantic granularityAllows a fine degree of semantic granularity

Results may contain noise or not represent users intention:

Use relevance feedbackOnly words from training vocabulary allowed y g yas query

Use ontologygy

Content-based image and video retrieval 57

Page 58: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Adding Semantics to Detectors for Video Retrieval

Snoek et al., IEEE Transactions on Multimedia 2007

Content-based image and video retrieval 58

Page 59: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

MediaMill

System consists of 101 machine learnedSystem consists of 101 machine learned high-level feature (=concepts) detectors

key-frame based video classificationkey frame based video classification

G l ll th t t l lGoal: allow the user to use natural language queries instead restricting him to limited

b l (l d t )vocabulary (learned concepts)

Idea: link WordNet to the concept-detector set

Content-based image and video retrieval 59

Page 60: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Semantically enriched detectory

Each detector is associated with:Each detector is associated with:Manually created textual description

Storms “outdoor scenes with stormy weatherStorms – outdoor scenes with stormy weather, thunderstorms, lightning”

Links to WordNetLinks to WordNetManually map description to WordNet synset(=group of semantically equivalent elements)(=group of semantically equivalent elements) descriptions

Visual modelVisual modelEstimates confidence to indicate if a concept is present in a shotpresent in a shot

Content-based image and video retrieval 60

Page 61: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

MediaMill

Detector selection strategies:Detector selection strategies:Text matching:

Select detector with highest similarity between query and g y q ydetector description

Ontology querying:Translate query to ontological conceptsQuery WordNet to determine which detector is most related to original queryrelated to original queryConcept “vehicle” is also defined by occurrences of its sub-concepts like “car”, “truck” etc.

Semantic visual queryingUse all concept detectors to classify concept of query images and select most likely detectorimages and select most likely detector

Content-based image and video retrieval 61

Page 62: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Results

Average Precision:Average Precision: Text Matching:50.8 %Ontology Querying: 56 0 %Ontology Querying: 56.0 %Visual Querying: 55.6 %

Average Precision after fusing selection strategies (linear combination of the results)strategies (linear combination of the results)

Text + Ontology: 65.5 %T t + Vi l 72 4 %Text + Visual: 72.4 %Ontology + Visual: 75.9 %All 83 4 %All: 83.4 %

Content-based image and video retrieval 62

Page 63: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

Lessons learned

Bridging the semantic gap is a major problemBridging the semantic gap is a major problem of CBIR

A plethora of approaches exist to deal with thisA plethora of approaches exist to deal with this problem

Automatic annotation algorithms allow a fine degree of semantical granularitydegree of semantical granularity

Linking detectors to lexical databases allows natural language querying

Content-based image and video retrieval 63

Page 64: Content-based image and video analysis Image and ... · Levels of image/video/retrieval Level 1: Based on color, texture, shape features Images are compared based on low-level features,

References

“Image-to-word transformation based on dividing and vectorImage to word transformation based on dividing and vector quantizing images with words”, Mori et al., MISRM 1999

“Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary”, Duygulu et al., ECCV 2002

“Automatic Image Annotation and Retrieval using Cross-Media R l M d l ” J t l ACM SIGR 2003Relevance Models”, Jeon et al., ACM SIGR 2003

“Addi S ti t D t t f Vid R t i l” S k t l“Adding Semantics to Detectors for Video Retrieval”, Snoek et al., IEEE Transactions on Multimedia, 2007

Content-based image and video retrieval 64