Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Lecture 10: Motion Features and Introduction to Content
Based Image and Video Retrieval
Dr Jing ChenNICTA & CSE UNSW
CS9519 Multimedia SystemsS2 2006
COMP9519 Multimedia Systems – Lecture 10 – Slide 2 – J Chen
Last lecture…Color features
Color and color spacesHistograms and similarity metricsColor descriptors – Dominant, Scalable
Texture featuresEdge featuresShape features
COMP9519 Multimedia Systems – Lecture 10 – Slide 3 – J Chen
Last lecture… (Color Feature)
Color SpaceRGB, HSV, HMMD, YCbCr
Color HistogramsRepresented by set of pairs (bin, frequency)Bin Binning -- Fixed, Cluster, Adaptive
COMP9519 Multimedia Systems – Lecture 10 – Slide 4 – J Chen
Last lecture… (Similarity Metrics)Lp Χ2 KL JD QF EMD
Symmetrical yes yes no yes yes yes
Computational complexity
medium medium medium medium high High
Ground distance no no no no yes yes
Adaptive binning support
no no no no yes yes
Partial matches no no no no no yes
Accuracy in image retrieval
Depending on the application; Χ2 usually gives reasonably good results
COMP9519 Multimedia Systems – Lecture 10 – Slide 5 – J Chen
Last lecture…Color Descriptors in MPEG-7
Dominant color, Scalable Color (HSV), color Structure (HMMD), Color Layout (YCbCr),
Dominant Color Descriptor (DCD)
Extraction of Dominant Color Minimizing distortion
Updating rule
Similarity Measurement of DCD
( ){ } NisvpcF iii ...2,1,,,, ==
∑∑ ∈=−=i k
ii CkxNickxkhD )(,...1,)()( 2
ii Ckxkh
kxkhc ∈=
∑∑ )(,
)()()(
COMP9519 Multimedia Systems – Lecture 10 – Slide 6 – J Chen
Last lecture… (Texture Feature)Approach to texture feature
Angular features (directionality)Radial features (coarseness)
Texture Feature DescriptorPartition in frequency domain 30 channelsenergy and energy deviation of each channelmean and standard variation of frequency coefficients
Edge HistogramLocal histogram 16 x 5 = 80 binsGlobal histogram accumulation of local histogramSemi-global histogram
COMP9519 Multimedia Systems – Lecture 10 – Slide 7 – J Chen
Last lecture… (Shape Feature)
Region-based descriptorContour-based descriptor
COMP9519 Multimedia Systems – Lecture 10 – Slide 8 – J Chen
OutlineMotion features
Camera motionMotion activityMotion trajectory
Introduction to content based image and video retrieval
COMP9519 Multimedia Systems – Lecture 10 – Slide 9 – J Chen
Motion estimationPixel based motion estimation
Optical flowComputing a velocity vector for each of the pixels in the frameHighly accurate motion estimationProblems:
Fails when variable lighting conditions or occlusionVulnerable to noiseComputational complexity
Block matchingSimple and effectiveUsed in MPEG-1/2/4, H.261/2/3/4 etc
COMP9519 Multimedia Systems – Lecture 10 – Slide 10 – J Chen
MPEG-7 motion descriptors
Parametric Motion uses the same motion model and syntax as the Warping Parameters
COMP9519 Multimedia Systems – Lecture 10 – Slide 11 – J Chen
Camera motionCaptures 3-D camera motion parameters
tracking (horizontal transverse movement, also called traveling in the film industry) booming (vertical transverse movement)dollying (translation along the optical axis)panning (horizontal rotation)tilting (vertical rotation)zooming (change of the focal length)rolling (rotation around the optical axis)
Pan right
Pan left
Tilt up
Tilt downRoll* MPEG-7
Track left
Track right
Boom up
Boom down
Dollybackward
Dollyforward
COMP9519 Multimedia Systems – Lecture 10 – Slide 12 – J Chen
Motion activity descriptorCapture the “intensity of action” or “pace of action” in a video segment
Examples of high activity including scenes such as “goal scoring in a soccer match”, “scoring in a baseball game”, “a high speed car chase”, etc. On the other hand, scenes such as “news reader shot”, “an interview scene”, “a still shot” etc. are perceived as low action shots.
Attributes:Intensity of ActivityDirection of ActivitySpatial distribution of ActivityTemporal Distribution of Activity
Applicationscontent repurposing, surveillance, fast browsing, video abstracting, video editing, content based querying
COMP9519 Multimedia Systems – Lecture 10 – Slide 13 – J Chen
Intensity of motion
A high value of intensity indicates high activity while a low value of intensity indicates low activity.
For example, a still shot has a low intensity of activity while a “fast break” basketball shot has a high intensity of activity.
COMP9519 Multimedia Systems – Lecture 10 – Slide 14 – J Chen
ExampleMotion_shot_00 (low motion)
COMP9519 Multimedia Systems – Lecture 10 – Slide 15 – J Chen
ExampleMotion_shot_17 (high motion)
COMP9519 Multimedia Systems – Lecture 10 – Slide 16 – J Chen
Extraction of intensity of motion activityFive intensity levels:
1) very low intensity; 2) low intensity; 3) medium intensity;4) high intensity; 5) very high intensity.
Use quantized standard deviation of the motion-vector magnitude in a video segment to compute the intensity of motion activity
* Jeannin, S., and A. Divakaran. MPEG-7 Visual Motion Descriptors, CSVT, Vol 11, No. 6, pp. 720-724, June 2001.
COMP9519 Multimedia Systems – Lecture 10 – Slide 17 – J Chen
Direction of Activity (optional)
While a video shot may have several objects with differing activity, we can often identify a dominant direction.
The direction parameter expresses the dominant direction of the activity if any.
It is expressed by a three-bits integer that has a value corresponding to any of eight equally spaced directions.
COMP9519 Multimedia Systems – Lecture 10 – Slide 18 – J Chen
ExampleMotion_shot_013 (direction of motion)
COMP9519 Multimedia Systems – Lecture 10 – Slide 19 – J Chen
Extraction of direction of activityAngle from dominant MV
int quantize_angle(float f_angle) {int direction;
/* quantize angle using uniform 3-bits quantizationover 0-360 degrees i.e. 0,45,90,135,180,225,270,315 */
if((f_angle>=-22.5)&&(f_angle<22.5)) direction=0; (000)else if((f_angle>=22.5)&&(f_angle<67.5)) direction=1; (001)else if((f_angle>=67.5)&&(f_angle<112.5)) direction=2; (010)else if((f_angle>=112.5)&&(f_angle<157.5)) direction=3; (011)else if((f_angle>=157.5)&&(f_angle<202.5)) direction=4; (100)else if((f_angle>=202.5)&&(f_angle<247.5)) direction=5; (101)else if((f_angle>=247.5)&&(f_angle<292.5)) direction=6; (110)else if((f_angle>=292.5)&&(f_angle<337.5)) direction=7; (111)
return direction;}
y
θ
MV
x
COMP9519 Multimedia Systems – Lecture 10 – Slide 20 – J Chen
Spatial distribution of ActivityIndicates whether the activity is spread across many regions or restricted to one large region.
It is an indication of the number and size of “active” regions in a frame.
For example, a talking head sequence would have one large activeregion,
While an aerial shot of a busy street would have many small active regions.
The spatial distribution parameter is expressed by three integers using a total of 16 bits.
COMP9519 Multimedia Systems – Lecture 10 – Slide 21 – J Chen
ExampleMotion_shot_26 (spatial distribution of activity)
COMP9519 Multimedia Systems – Lecture 10 – Slide 22 – J Chen
Temporal Distribution of ActivityExpress the variation of activity over a video duration.
Represented by a parameter expressed by five 6-bits integers
The histogram consists of 5 bins, where histogram bins with indexes N0, N1, N2, N3, and N4 correspond to Intensity value of 1, 2, 3, 4, and 5 respectively.
The histogram expresses the relative frequency of different levels of activity in the sequence as defined by the intensity element above.
Each value is the percentage of occurrences of the correspondingquantized intensity level uniformly quantized to 6 bits.
COMP9519 Multimedia Systems – Lecture 10 – Slide 23 – J Chen
ExampleMotion_shot_032 (temporal distribution of activity)
COMP9519 Multimedia Systems – Lecture 10 – Slide 24 – J Chen
Motion trajectoryDescribes the displacements of objects in time
Trajectory model is first- or second-order piecewise approximation along time, for each spatial dimension
Key-points:representing the successive spatio-temporal positions of the described objecta set of (x,y,t) for 2-D x,y trajectory or (x,y,z,t) for 3-D x,y,ztrajectory
By default, linear interpolation (first order) between key-points is used
Interpolating parameters can be added to specify nonlinear interpolations between key-points, using a second-order function of time
COMP9519 Multimedia Systems – Lecture 10 – Slide 25 – J Chen
Example – motion trajectory
(50, 120, 5/30) (52, 120, 15/30) (54, 120, 25/30)
COMP9519 Multimedia Systems – Lecture 10 – Slide 26 – J Chen
ExampleLinear interpolation (example code in Matlab)
COMP9519 Multimedia Systems – Lecture 10 – Slide 27 – J Chen
ExampleLinear interpolation
COMP9519 Multimedia Systems – Lecture 10 – Slide 28 – J Chen
ExampleSecond order (polynomial) interpolation
COMP9519 Multimedia Systems – Lecture 10 – Slide 29 – J Chen
First and second order interpolationFirst order (linear) interpolation
Second order (polynomial) interpolation
Example of trajectory representation (one dimension)
* Jeannin, S., and A. Divakaran. MPEG-7 Visual Motion Descriptors, CSVT, Vol 11, No. 6, pp. 720-724, June 2001.
COMP9519 Multimedia Systems – Lecture 10 – Slide 30 – J Chen
Extraction of motion trajectory descriptorAssuming the position of objects is known
May be generated through segmentation/tracking (difficult though)
Selection of Key-points and their FunctionsNot defined by MPEG-7 standardOption 1: Key-points can be selected using regular time intervals sampling (simplest way)Option 2 (bottom up): Starting from lots of key-points, recursively remove points until the interpolation error exceeds a given thresholdOption 3 (top down): Starting with one interval containing two points, recursively splits intervals into two at the position where the interpolation error is maximum
COMP9519 Multimedia Systems – Lecture 10 – Slide 31 – J Chen
Option 1: regular time intervals sampling Black: true trajectoryRed: linear interpolation
COMP9519 Multimedia Systems – Lecture 10 – Slide 32 – J Chen
Option 2: bottom up (part 1) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 33 – J Chen
Option 2: bottom up (part 1) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 34 – J Chen
Option 2: bottom up (part 2) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 35 – J Chen
Option 2: bottom up (part 2) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 36 – J Chen
Option 2: bottom up (part 3) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 37 – J Chen
Option 2: bottom up (part 4) Black: true trajectoryRed: linear interpolation
COMP9519 Multimedia Systems – Lecture 10 – Slide 38 – J Chen
Option 3: top down (part 1) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 39 – J Chen
Option 3: top down (part 2) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 40 – J Chen
Option 3: top down (part 3) Black: true trajectory
COMP9519 Multimedia Systems – Lecture 10 – Slide 41 – J Chen
OutlineMotion features
Camera motionMotion activityMotion trajectory
Introduction to content based image and video retrieval
Text-based retrievalContent-based retrievalQuery formationFeature extractionSimilarity comparisonPerformance evaluation
COMP9519 Multimedia Systems – Lecture 10 – Slide 42 – J Chen
Google video search
COMP9519 Multimedia Systems – Lecture 10 – Slide 43 – J Chen
Google video search result part
COMP9519 Multimedia Systems – Lecture 10 – Slide 44 – J Chen
A closer look
COMP9519 Multimedia Systems – Lecture 10 – Slide 45 – J Chen
Text-based approach for image and video retrieval
Keywords annotation + text-based searching technique from traditional database management systemsAnnotation methods
By humanlabor intensive, subjective, content-sensitive and usually incomplete
To extract annotations from speech transcripts: Google video search
low accuracy May be improved with better machine understanding of natural languages (difficult!)
Automated machine understanding of images and videos“Semantic gap” between keywords and low level visual featuresChallenging research topic
COMP9519 Multimedia Systems – Lecture 10 – Slide 46 – J Chen
More problems of text-based retrievalFails to query image content: certain visual properties (pattern, colors, shapes, textures) such as some textures, and shapes are different or nearly impossible to describe with text.Limited scope: pre-determined dictionary
COMP9519 Multimedia Systems – Lecture 10 – Slide 47 – J Chen
Bridging the semantic gapPattern recognition: develop a recognizer/classifier for each query concept
Eg, face detectorA simple and typical approach is feature extraction from images/video + classifier (eg, Support Vector Machines)Hard to generalize; impractical to develop classifiers for every possible query concept
Ontology (eg, broadcasting news)
ObjectsActions
Sites
Concepts
Outdoors IndoorsPerson
PeopleFace
NewsSubjectAnchor
Crowd
NewsMonolog
NewsDialog Studio
MPEG-7 Video Annotation Tool
COMP9519 Multimedia Systems – Lecture 10 – Slide 48 – J Chen
IBM Video Annex Demohttp://www.alphaworks.ibm.com/tech/videoannex
COMP9519 Multimedia Systems – Lecture 10 – Slide 49 – J Chen
Content based image and video retrieval
Emerged in early 1990s
Represent and index image/video with features (color, texture, shape, etc) extracted from the image/video content
Typical systems: QBIC, VisualSeek, SimPlicity, etc (one in this lect. Others in lect. 13)
COMP9519 Multimedia Systems – Lecture 10 – Slide 50 – J Chen
Image retrieval system diagram
Query formation
Feature extraction
Image database
Feature extraction
Image data
Similarity comparison
Indexing & retrieval
Retrieval results
Relevance feedback
user Feature vectors
output
Feature vectors
Feature vectors
COMP9519 Multimedia Systems – Lecture 10 – Slide 51 – J Chen
Query specificationA process of connecting user input with feature extraction to get feature vectors searchable in the database
Four major categories:Category browsing
Images are classified into different categories based on their semantic or visual content
Query by conceptUser supplied keyword -> concept (annotation); ie, text based
Query by sketchUser drawn sketch -> vectors
Query by exampleUser supplied example image -> vectors
Query specification
user vectorsFeature extraction
COMP9519 Multimedia Systems – Lecture 10 – Slide 52 – J Chen
Category browsing (1)
* A. Vailaya, A. K. Jain, and H. J. Zhang, “On image classification: City images vs. landscapes,” Pattern Recognit., vol. 31, no. 12, pp. 1921–1936, 1998.
COMP9519 Multimedia Systems – Lecture 10 – Slide 53 – J Chen
Category browsing (2)
* A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang, "Image Classification for Content-Based Indexing," IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117--130, 2001.
COMP9519 Multimedia Systems – Lecture 10 – Slide 54 – J Chen
Image categorical pre-filtering may improve retrieval accuracy
(a) Query image
(b) top-ten retrieved images from 2145 city and landscape images
(c) top-ten retrieved images from 760 city images; filtering out landscape images prior to querying clearly improves the retrieval results.
* A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang, "Image Classification for Content-Based Indexing," IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117--130, 2001.
COMP9519 Multimedia Systems – Lecture 10 – Slide 55 – J Chen
Limitations of categorical browsing
Ambiguity in categorizing images/videos
Images/videos found depending on the browsing path
Difficult to use if the number of categories is large
The ability to search is preferred in many applications
COMP9519 Multimedia Systems – Lecture 10 – Slide 56 – J Chen
Query by concept
A. Natsev, A. Chadha, B. Soetarman, and J. S. Vitter. ``CAMEL: Concept Annotated iMagELibraries,'' Proc. of SPIE Electronic Imaging 2001: Storage and Retrieval for Image and Video Databases, San Jose, CA, Jan 2001.
COMP9519 Multimedia Systems – Lecture 10 – Slide 57 – J Chen
Query by sketch
VisualSEEK user interface
The user sketches regions, positions them on the query grid assigns them properties of color, size and absolute location may also assign boundaries for location and size.
* John R. Smith , Shih-Fu Chang, VisualSEEk: a fully automated content-based image query system, Proc ACM Int Conf on Multimedia, p.87-98, Nov 18-22, 1996, USA
COMP9519 Multimedia Systems – Lecture 10 – Slide 58 – J Chen
VisualSEEK examples
* John R. Smith , Shih-Fu Chang, VisualSEEk: a fully automated content-based image query system, Proc ACM Int Conf on Multimedia, p.87-98, Nov 18-22, 1996, USA
COMP9519 Multimedia Systems – Lecture 10 – Slide 59 – J Chen
Query by example
Using shape feature in the above exampleW. Y. Ma and B. S. Manjunath, " NeTra: a toolbox for navigating large image databases", Multimedia Systems, vol.7, (no.3), Springer-Verlag, Berlin, Germany, pp.184-98, May 1999.
COMP9519 Multimedia Systems – Lecture 10 – Slide 60 – J Chen
Image retrieval system diagram
Query formation
Feature extraction
Image database
Feature extraction
Image data
Similarity comparison
Indexing & retrieval
Retrieval results
Relevance feedback
user Feature vectors
output
Feature vectors
Feature vectors
COMP9519 Multimedia Systems – Lecture 10 – Slide 61 – J Chen
Visual features - recapWhy visual features?
Manual labeling is very time consumingContent difficult to be described by text completelyMachine understanding of image/video is far from mature
What visual featuresExtractable from image/videoLearn from human visual system
Visual feature => feature vectors
COMP9519 Multimedia Systems – Lecture 10 – Slide 62 – J Chen
Popular visual featuresColor
Color histogram (HSV, YCbCr,…)Color momentsDominant color
Texturestructural and statisticalTexture histogramEdge histogram
Shapeboundaries of objects
MotionCamera motion (PZT)Object motion
COMP9519 Multimedia Systems – Lecture 10 – Slide 63 – J Chen
Content based retrieval system diagram
Query formation
Feature extraction
Image database
Feature extraction
Image data
Similarity comparison
Indexing & retrieval
Retrieval results
Relevance feedback
user Feature vectors
output
Feature vectors
Feature vectors
COMP9519 Multimedia Systems – Lecture 10 – Slide 64 – J Chen
Similarity comparisonGiven two feature vectors I, J, the distance is defined as D(I,J) = f(I,J)Typical similarity metrics
Lp (Minkowski distance)Χ2 metric KL (Kullback-Leibler Divergence)JD (Jeffrey Divergence)QF (quadratic form)EMD (Earth mover’s distance)
COMP9519 Multimedia Systems – Lecture 10 – Slide 65 – J Chen
K-nearest neighbour searchGiven a query vector vq, a brute-force k-nearest neighbour search method is (essentially): results = [ ]; maxD = infinity for each obj in the database {
dist = D(vobj,vq) if (#results < k or dist < maxD) {
insert (obj,dist) into results // results is sorted with
length <= k maxD = largest dist in results
} } Cost = Topen + NPTP + NTDNote: If q is an image from the database, we can use a pre-computed distance table to make this much faster.
* John Shepherd
Name Meaning Typically
N number of objects in the database
103 .. 1010
NP number of disk pages to hold stored objects
50 .. 1010
TP time to read a page from disk into memory
10ms
TD time to compute distance between two objects (using vectors)
100us (?)
Topen time to open a database file
10ms
COMP9519 Multimedia Systems – Lecture 10 – Slide 66 – J Chen
Content based retrieval system diagram
Query formation
Feature extraction
Image database
Feature extraction
Image data
Similarity comparison
Indexing & retrieval
Retrieval results
Relevance feedback
user Feature vectors
output
Feature vectors
Feature vectors
COMP9519 Multimedia Systems – Lecture 10 – Slide 67 – J Chen
Performance evaluationWe have three numbers: #system-correctly-retrieved-images, #system-retrieved-images, #relevant-images-in-DBPrecision = #system-correctly-retrieved-images / #system-retrieved-imagesRecall = #system-correctly-retrieved-images / #relevant-images-in-DB F# = (2 x precision x recall) / (precision + recall)
Recall
Precision
COMP9519 Multimedia Systems – Lecture 10 – Slide 68 – J Chen
A tutorial questionSuppose we have 1000 images in the databaseWe want to retrieve images with concept “car”There are 200 “car” images in the databaseWe retrieved 250 images, and there are 150 “car”images in these 250 imagesCalculate precision, recall and F-number.
COMP9519 Multimedia Systems – Lecture 10 – Slide 69 – J Chen
A demo retrieval systemMARVeL – from IBM Research
Exe fileResult: file:///c:/marvel/docs/html/main/0/index.html
COMP9519 Multimedia Systems – Lecture 10 – Slide 70 – J Chen
IBM in TRECVID 2004
Visual features included color histograms, edge histograms, color moments, wavelet texture, co-occurrence texture, moment invariants etc.
COMP9519 Multimedia Systems – Lecture 10 – Slide 71 – J Chen
Assignment2See http://www.cse.unsw.edu.au/~cs9519/assig-2/Submission deadline 4 Nov 2005Start early to avoid the late rush and possible conflicts with exams!
COMP9519 Multimedia Systems – Lecture 10 – Slide 72 – J Chen
Some references
S. Jeannin and A. Divakaran. MPEG-7 visual motion descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6):720-724, Jun 2001.
B.S. Manjunath , Phillipe Salembier , Thomas Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, John Wiley & Sons, Inc., New York, NY, 2002 (Book)
Chapter 1, Fundamentals of content-based image retrieval, by F. Long, H.-J. Zhang and D. Feng, in book Multimedia Information Retrieval and Management, http://research.microsoft.com/asia/dload_files/group/mcomputing/2003P/ch01_Long_v40-proof.pdf