Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Content-based image andgvideo analysis
Image and Video SemanticsImage and Video Semantics
14.06.2010
Levels of image/video/retrievalg
Level 1: Based on color, texture, shape featurespImages are compared based on low-level features, no semantics involved
f fA lot of research done, is a feasible taskLevel 2: Bring semantic meanings into search
e g identif ing h man beings horses trees beachese.g. identifying human beings, horses, trees, beachesRequires retrieval techniques of level 1
Level 3: Retrieval with abstract and subjective attributesLevel 3: Retrieval with abstract and subjective attributesFind pictures of a particular birthday celebrationFind a picture of a happy beautiful womand a p ctu e o a appy beaut u o aRequires retrieval techniques of level 2 and very complex logic
Content-based image and video retrieval 2
Semantic gapg p
“lack of coincidence between the informationlack of coincidence between the information that one can extract from the visual data and the interpretation that the same data have forthe interpretation that the same data have for a user in a given situation” (Smeulders et al., 2000)2000)“Semantic gap” = gap between level 1 and level 2level 2
This lecture:Overview of techniques to “bridge” this gap
Content-based image and video retrieval 3
What is semantics?
Semantics = meaningSemantics meaningBUT: What is the semantics of this picture?
Content-based image and video retrieval 4
What is semantics?
What is the semanticsWhat is the semantics of this picture?
“Upper-right uniform light-blue region and upper-left black region and middle dark blue region and uniformblack region and middle dark blue region and uniform lower light-brown region”?Sea, sky, sand, mountains, waves, boats?Beach with mountains in the background?Sunset at the beach?Rio de Janeiro, sugar loaf?…
Content-based image and video retrieval 5
Challengesg
Semantics has many degrees of granularitySemantics has many degrees of granularity
Semantics is in the e e of the beholderSemantics is in the eye of the beholder
Different ways to express the same
When attempting to bridge the semantic gap, CBIR systems also should pay attention to allCBIR systems also should pay attention to all those points
Content-based image and video retrieval 6
Ways to bridge the semantic gapy g g p
1. Machine learning1. Machine learning supervisedunsupervisedunsupervised
2. Ontology based3 R l f db k3. Relevance feedback4. Web image retrieval
…
Usually a combination of those approaches
Content-based image and video retrieval 7
Ways to bridge the semantic gapy g g p
1 Machine learning1. Machine learning
2. Ontology based
3. Relevance feedback
4 Web image retrieval4. Web image retrieval
Content-based image and video retrieval 8
Machine learningg
Basic steps:Basic steps:Convert image to low-level features (color/edge histograms, wavelets, DCT, etc.)g , , , )Use machine learning algorithm to map features to image/video semantics (SVM, ANN, CART, etc.)
Examples:Examples:Genre ClassificationHigh-level Feature DetectionHigh level Feature DetectionPerson IdentificationEvent RecognitionEvent Recognition
Content-based image and video retrieval 9
Machine learningg
Supervised learningSupervised learningSemantic concept of training data samples is known in advance
Unsupervised learningUnsupervised learningNo prior knowledge availableE g clustering of data based on some similarityE.g. clustering of data based on some similarity measureSemantics can be assigned manually to clusterSemantics can be assigned manually to cluster
Content-based image and video retrieval 10
Machine learningg
Problems:Problems:Lots of (annotated) training data needed
Bootstrappingpp gHuman Computation (ESP game, Google Image Labeler -> see “Tools” lecture)
For each object-class/concept a new detector needs to be learned
Instead of training specific detectors exploit objectInstead of training specific detectors exploit object similarity to translate images to words directly (Examples at the end of this lecture)
Limited query vocabularyLink query vocabulary to lexical database (ontologies)
Content-based image and video retrieval 11
Ways to bridge the semantic gapy g g p
1 Machine learning1. Machine learning
2. Ontology based
3. Relevance feedback
4 Web image retrieval4. Web image retrieval
Content-based image and video retrieval 12
Ontology in Philosophygy p y
Philosophical discipline that deals with thePhilosophical discipline that deals with the nature and the organization of realityAlso known as Aristotle’s MetaphysicsAlso known as Aristotle s MetaphysicsTries to answer the question
Wh t i b i ?What is being?What are the features common to all beings?
Representation of entities and events along p gwith their properties, relations and properties
Similar to Ontology in Computer Sciencegy p
Content-based image and video retrieval 13
Ontology in Computer Sciencegy p
“formal, explicit specification of a sharedformal, explicit specification of a shared conceptualisation” (Tom Gruber, 1993)
Formal representation of concepts within a domain and the relationships between thedomain and the relationships between the concepts
Content-based image and video retrieval 14
Why use ontologies?y g
LabelingLabelingIf one says “cat” and the other “feline”, how is system to know that both are the same? y
SemanticsSemanticsHow should the system know that “lions”, “tigers” and “house cats” are all cats?and house cats are all cats?
Knowledge sharing and reuseKnowledge sharing and reuseNeed to be able to create definitions of terms in a machine-understandable formatmachine understandable format
Content-based image and video retrieval 15
Ontology componentsgy p
Concepts RelationshipsConceptsCat, dog
RelationshipsIs a, part of
PropertiesL h
AxiomsCats cannot eat onlyLength, age Cats cannot eat only vegetation
ConstraintsMaximum value Individuals
G fi ldGarfield as an instance of cat
Content-based image and video retrieval 16
Example ontologyp gy
Simple ontology with “is a” relationship:Simple ontology with is a relationship:
Instances:Instances:
Content-based image and video retrieval 17
Ontology: WordNetgy
Lexical database for the English languageLexical database for the English languagehttp://wordnet.princeton.edu150K ords organi ed in >115K s nsets150K words organized in >115K synsets(group of semantically equivalent elements)Most synsets connected via semantic relationsNouns and verbs organized into hierarchies
Can be interpreted as an ontology
Content-based image and video retrieval 18
Ontology: WordNetgy
Directed “is a” relationships:Directed is a relationships:X is a hypernym of Y if every Y is a kind of X
Cat is a hypernym of lionyp y
X is hyponym of Y if every X is a kind of YLion is a hyponym of cat
Directed “part of” relationshipsX is holonym of Y if Y is part of XX is holonym of Y if Y is part of X
Cat is a holonym of fur
X is meronym of Y if X is part of YX is meronym of Y if X is part of YFur is a meronym of cat
Content-based image and video retrieval 19
Ontologygy
One way to bridge the semantic gap is to linkOne way to bridge the semantic gap is to link general purpose ontologies (eg. WordNet) to detectorsdetectors
Example:Example:
Content-based image and video retrieval 20
Object ontologyj gy
Another way to use an ontology to bridge theAnother way to use an ontology to bridge the semantic gap
Derive semantics from daily language:“ k ” b d ib d “ if d“sky” can be described as “upper, uniform and blue region”
Describe query and image objects in k l d b ith l iti i dknowledge-base with color, position, size and shape representation from object ontology
Content-based image and video retrieval 21
Object Ontologyj gy
Content-based image and video retrieval 22
Ways to bridge the semantic gapy g g p
1 Machine learning1. Machine learning
2. Ontology based
3. Relevance feedback
4 Web image retrieval4. Web image retrieval
Content-based image and video retrieval 23
Relevance feedback
Bring user in the retrieval loop to reduce theBring user in the retrieval loop to reduce the semantic gap between what queries represent and what the user thinks:represent and what the user thinks:
System provides initial retrieval results for query User judges relevance of the resultsUser judges relevance of the resultsMachine learning algorithm is applied to user’s feedback to refine the search
See next lecture for more ☺See next lecture for more ☺Content-based image and video retrieval 24
Ways to bridge the semantic gapy g g p
1 Machine learning1. Machine learning
2. Ontology based
3. Relevance feedback
4 Web image retrieval4. Web image retrieval
Content-based image and video retrieval 25
Web based image retrievalg
Popular web-search engines offer an imagePopular web search engines offer an image search function
Search only based on textual evidencesSearch only based on textual evidencesUnable to confirm if retrieved images really contain desired conceptsp
Basic idea:Use text-based classifier to detect concepts withinUse text-based classifier to detect concepts within textUse visual-based classifier to detect conceptsUse visual based classifier to detect concepts within imagesFuse both classification results
Content-based image and video retrieval 26
Conclusion
ML only allows a limited query vocabularyML only allows a limited query vocabularyIt is possible to lower the limitations significantly by using an ontologysignificantly by using an ontologyEach of the named techniques can not fully bridge the semantic gapbridge the semantic gap
Modern CBIR systems usually consist of a combination of some of the introduced techniques
Content-based image and video retrieval 27
Example Systemsp y
Automatic image annotationAutomatic image annotationGoal: Given an un-annotated image, automatically assign meaningful keywordsg g yApproaches
Co-occurrence model (Mori et al. 1999)Translation model (Duygulu et al. 2002)Cross-media Relevance model (Jeon et al. 2003)
Link lexical ontology to high level detectorsMediaMill – Semantic Video Search Engine (Snoeket al. 2007)
Content-based image and video retrieval 28
Automatic image annotationg
Basic procedure:Basic procedure:1. Training data consists of many images with key
wordswords
Corel stock photo CDsCorel stock photo CDs
Content-based image and video retrieval 29
Automatic image annotationg
Basic procedure:Basic procedure:1. Training data consists of many images with key
wordswords2. Divide each image into regions and extract
features from themfeatures from themUniform gridBlobworldBlobworldNormalized-cuts
Content-based image and video retrieval 30
Automatic image annotationg
Basic procedure:Basic procedure:1. Training data consists of many images with key
wordswords2. Divide each image into regions and extract
features from themfeatures from them3. Cluster image regions
Cluster centers = blobsCluster centers = blobs
Content-based image and video retrieval 31
Automatic image annotationg
Basic procedure:Basic procedure:1. Training data consists of many images with key
wordswords2. Divide each image into regions and extract
features from themfeatures from them3. Cluster image regions4 Approximate PDF to model relationship4. Approximate PDF to model relationship
between images (blobs) and wordsCo occurrence modelCo-occurrence modelTranslation modelCross-media relevance modelCross media relevance model
Content-based image and video retrieval 32
Automatic image annotationg
For each unknown image:For each unknown image:1. Divide image into regions
2. Describe each region with index of nearest blob
3. Use learned PDF to translate blobs to words
Content-based image and video retrieval 33
Co-occurrence model
Idea:Idea:One image has the key words “sky” and “beach”Another image has the key words “sky” andAnother image has the key words sky and “mountain”Each region inherits all words from its imageac eg o e s a o ds o s ageAccumulate information from both images
Regions associated to same blob share all their wordsg
Now, the sky regions have two words “sky”, one word “beach” and one word “mountain”
Image-to-word transformation based on dividing and vector quantizing images with words
Content-based image and video retrieval 34
(Mori et al. 1999)
Co-occurrence model
1. Use many images with key words for learningy g y g2. Divide each image into parts, extract features3. Each part inherits all words from its imagep g4. Find clusters from all divided images5. Accumulate the frequencies of words of all parts in5. Accumulate the frequencies of words of all parts in
each cluster, and calculate the likelihood for every word
6. For an unknown image, divide it into parts, extract features, find nearest clusters for all parts. Combine th lik lih d f th i l t d d t i i hthe likelihoods of their clusters and determine wichwords are most plausible
Content-based image and video retrieval 35
Co-occurrence model
Building the model:Building the model:Approximate for each word and each blob :
Bayes formulaBayes formula
Content-based image and video retrieval 36
Co-occurrence model
Building the model:Building the model:Approximate for each word and each blob :
Total of wordin blob
Total of all wordsf ll d t
Total of wordin all data
for all data
Content-based image and video retrieval 37
in all data
Co-occurrence model
Content-based image and video retrieval 38
Co-occurrence model
Annotating an unknown image:Annotating an unknown image:Average the likelihoods of all blobs in the imageg
Words with largest average likelihood value are o ds a ges a e age e ood a ue a eestimated to be image labels
Content-based image and video retrieval 39
Co-occurrence model
Content-based image and video retrieval 40
Co-occurrence model: some results
Content-based image and video retrieval 41
Top three results. Bold words match manual annotations.
Translation model
View image annotation as a task of translating from a g gvocabulary of blobs (L1) to a vocabulary of words (L2)
Aligned “texts” in both languages (blobs, words) are available (“ li d bit t ”)(“aligned bitexts”)Here: Words associated to whole images are known
Learning a lexicon for L1-> L2 from aligned bitexts is a standard problem in machine translationp
Goal: determine precise correspondences between the words in both languages H Whi h d ith hi h i i ?Here: Which word goes with which image region?
Object Recognition as Machine Translation:L i L i f Fi d I V b l
Content-based image and video retrieval 42
Learning a Lexicon for a Fixed Image Vocabulary(Duygulu et al. 2002)
Correspondences between blobs and wordsp
Associated keywords and segments are availablePrecise correspondence between words and segmentsPrecise correspondence between words and segments are missing
Goal: Find correspondences between segments and wordswords
Content-based image and video retrieval 43
Translation model
Building the model:Building the model:Use EM to estimate the parameters of the likelihoodlikelihood
E step: Use an estimate of to predict the word to blob correspondencesM step: use the correspondences to refine the estimate of Initialization: random values
Content-based image and video retrieval 44
Translation model
Building the model:Building the model:Use EM to estimate the parameters of the likelihoodlikelihood Need to find
P( i) P b bilit th t i i bl b b iP(anj=i): Probability that in an image n, a blob bi is associated with word wj
t(w|b): Probability of obtaining word w given blob bt(w|b): Probability of obtaining word w given blob b
Content-based image and video retrieval 45
Translation model
Post-processing:Post processing:Cluster visually indistinguishable words
cat tiger eagle jetcat – tiger, eagle – jetAssign a null-word to all blobs whose
b bilit t b i t d ith d iprobability to be associated with a word is too small
Th h ld l l d lid iThreshold value learned on validation setFit a new lexicon with a reduced vocabulary after thresholding since some words may never be predicted with sufficient probability
Content-based image and video retrieval 46
Translation model
Some good resultsSome good results
Null words
Clustering
Content-based image and video retrieval 47
Cross-Media Relevance model
Relevance model = underlying PDF P(.|I) forRelevance model underlying PDF P( |I) for each image
Can be thought as an urn containing all possibleCan be thought as an urn containing all possible blobs and key words that could appear in I
Annotate image by sampling words from P(.|I)Annotate image by sampling words from P( |I)
Probability of observing any word wi whenProbability of observing any word wi when sampling from P(.|I) needs to be known (i.e. P(wi|I)P(wi|I)
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models
Content-based image and video retrieval 48
(Jeon et al. 2003)
Cross-Media Relevance model
Approximate P(wi|I) by the probability of observing pp ( | ) y p y gwi given a previously observed set of blobsb1,…,bm:, ,
use of maximum-likelihood estimator not possible, since no annotation of blobs is available in thesince no annotation of blobs is available in the images Use training set of images to estimateUse training set of images to estimate and then marginalize the PDF with respect to wi
Content-based image and video retrieval 49
Cross-Media Relevance model
Difference to co-occurrence and translationDifference to co occurrence and translation model:
Instead of learning one to one correspondencesInstead of learning one to one correspondences CMRM learns a joint PDF to model the correspondence between a set of words and a set of regions
takes context on blob level into accounteg. tiger+grass is more likely than tiger+beachg g g y g
Content-based image and video retrieval 50
Cross-Media Relevance model
Three different models possible:Three different models possible:1. Probabilistic CMRM (PACMRM):
Return all word probabilitiesReturn all word probabilitiesGood for ranked retrieval
2 Fi d t ti b d CMRM (FACMRM)2. Fixed annotation based CMRM (FACMRM):Report only the N best wordsGood for people to look at
3. Direct-retrieval CMRM (DRCMRM):Use model to translate words to blobs and look for similar regions in each test image
Content-based image and video retrieval 51
Cross-Media Relevance model
Performance of the two ranked retrievalPerformance of the two ranked retrieval models
Content-based image and video retrieval 52
Cross-Media Relevance model
Automatic annotations (best four words)Automatic annotations (best four words) compared to manual annotations:
Content-based image and video retrieval 53
Automatic annotation - Results
Precision:Precision:
Content-based image and video retrieval 54
Automatic annotation - Results
Recall:Recall:
Content-based image and video retrieval 55
Automatic annotation - Results
Mean Precision Mean RecallMean PrecisionCo-occurrence model:
Mean RecallCo-occurrence model:
0.07 0.11
Translation model:0.14
Translation model:0.24
CRMRM:0 33
CRMRM:0 370.33 0.37
Content-based image and video retrieval 56
Automatic annotation – Conclusion
Easier to train than dedicated conceptEasier to train than dedicated concept detectors (no tuning needed)Allows a fine degree of semantic granularityAllows a fine degree of semantic granularity
Results may contain noise or not represent users intention:
Use relevance feedbackOnly words from training vocabulary allowed y g yas query
Use ontologygy
Content-based image and video retrieval 57
Adding Semantics to Detectors for Video Retrieval
Snoek et al., IEEE Transactions on Multimedia 2007
Content-based image and video retrieval 58
MediaMill
System consists of 101 machine learnedSystem consists of 101 machine learned high-level feature (=concepts) detectors
key-frame based video classificationkey frame based video classification
G l ll th t t l lGoal: allow the user to use natural language queries instead restricting him to limited
b l (l d t )vocabulary (learned concepts)
Idea: link WordNet to the concept-detector set
Content-based image and video retrieval 59
Semantically enriched detectory
Each detector is associated with:Each detector is associated with:Manually created textual description
Storms “outdoor scenes with stormy weatherStorms – outdoor scenes with stormy weather, thunderstorms, lightning”
Links to WordNetLinks to WordNetManually map description to WordNet synset(=group of semantically equivalent elements)(=group of semantically equivalent elements) descriptions
Visual modelVisual modelEstimates confidence to indicate if a concept is present in a shotpresent in a shot
Content-based image and video retrieval 60
MediaMill
Detector selection strategies:Detector selection strategies:Text matching:
Select detector with highest similarity between query and g y q ydetector description
Ontology querying:Translate query to ontological conceptsQuery WordNet to determine which detector is most related to original queryrelated to original queryConcept “vehicle” is also defined by occurrences of its sub-concepts like “car”, “truck” etc.
Semantic visual queryingUse all concept detectors to classify concept of query images and select most likely detectorimages and select most likely detector
Content-based image and video retrieval 61
Results
Average Precision:Average Precision: Text Matching:50.8 %Ontology Querying: 56 0 %Ontology Querying: 56.0 %Visual Querying: 55.6 %
Average Precision after fusing selection strategies (linear combination of the results)strategies (linear combination of the results)
Text + Ontology: 65.5 %T t + Vi l 72 4 %Text + Visual: 72.4 %Ontology + Visual: 75.9 %All 83 4 %All: 83.4 %
Content-based image and video retrieval 62
Lessons learned
Bridging the semantic gap is a major problemBridging the semantic gap is a major problem of CBIR
A plethora of approaches exist to deal with thisA plethora of approaches exist to deal with this problem
Automatic annotation algorithms allow a fine degree of semantical granularitydegree of semantical granularity
Linking detectors to lexical databases allows natural language querying
Content-based image and video retrieval 63
References
“Image-to-word transformation based on dividing and vectorImage to word transformation based on dividing and vector quantizing images with words”, Mori et al., MISRM 1999
“Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary”, Duygulu et al., ECCV 2002
“Automatic Image Annotation and Retrieval using Cross-Media R l M d l ” J t l ACM SIGR 2003Relevance Models”, Jeon et al., ACM SIGR 2003
“Addi S ti t D t t f Vid R t i l” S k t l“Adding Semantics to Detectors for Video Retrieval”, Snoek et al., IEEE Transactions on Multimedia, 2007
Content-based image and video retrieval 64