Lecture 10: Motion Features and Introduction to Content

Lecture 10: Motion Features and Introduction to Content

Based Image and Video Retrieval

Dr Jing ChenNICTA & CSE UNSW

CS9519 Multimedia SystemsS2 2006

[email protected]

mailto:[email protected]

COMP9519 Multimedia Systems – Lecture 10 – Slide 2 – J Chen

Last lecture…Color features

Color and color spacesHistograms and similarity metricsColor descriptors – Dominant, Scalable

Texture featuresEdge featuresShape features


Last lecture… (Color Feature)

Color SpaceRGB, HSV, HMMD, YCbCr

Color HistogramsRepresented by set of pairs (bin, frequency)Bin Binning -- Fixed, Cluster, Adaptive


Last lecture… (Similarity Metrics)Lp Χ2 KL JD QF EMD

Symmetrical yes yes no yes yes yes

Computational complexity

medium medium medium medium high High

Ground distance no no no no yes yes

Adaptive binning support

no no no no yes yes

Partial matches no no no no no yes

Accuracy in image retrieval

Depending on the application; Χ2 usually gives reasonably good results


Last lecture…Color Descriptors in MPEG-7

Dominant color, Scalable Color (HSV), color Structure (HMMD), Color Layout (YCbCr),

Dominant Color Descriptor (DCD)

Extraction of Dominant Color Minimizing distortion

Updating rule

Similarity Measurement of DCD

( ){ } NisvpcF iii ...2,1,,,, ==

∑∑ ∈=−=i k

ii CkxNickxkhD )(,...1,)()( 2

ii Ckxkh

kxkhc ∈=

∑∑ )(,

)()()(


Last lecture… (Texture Feature)Approach to texture feature

Angular features (directionality)Radial features (coarseness)

Texture Feature DescriptorPartition in frequency domain 30 channelsenergy and energy deviation of each channelmean and standard variation of frequency coefficients

Edge HistogramLocal histogram 16 x 5 = 80 binsGlobal histogram accumulation of local histogramSemi-global histogram


Last lecture… (Shape Feature)

Region-based descriptorContour-based descriptor


OutlineMotion features

Camera motionMotion activityMotion trajectory

Introduction to content based image and video retrieval


Motion estimationPixel based motion estimation

Optical flowComputing a velocity vector for each of the pixels in the frameHighly accurate motion estimationProblems:

Fails when variable lighting conditions or occlusionVulnerable to noiseComputational complexity

Block matchingSimple and effectiveUsed in MPEG-1/2/4, H.261/2/3/4 etc


MPEG-7 motion descriptors

Parametric Motion uses the same motion model and syntax as the Warping Parameters


Camera motionCaptures 3-D camera motion parameters

tracking (horizontal transverse movement, also called traveling in the film industry) booming (vertical transverse movement)dollying (translation along the optical axis)panning (horizontal rotation)tilting (vertical rotation)zooming (change of the focal length)rolling (rotation around the optical axis)

Pan right

Pan left

Tilt up

Tilt downRoll* MPEG-7

Track left

Track right

Boom up

Boom down

Dollybackward

Dollyforward


Motion activity descriptorCapture the “intensity of action” or “pace of action” in a video segment

Examples of high activity including scenes such as “goal scoring in a soccer match”, “scoring in a baseball game”, “a high speed car chase”, etc. On the other hand, scenes such as “news reader shot”, “an interview scene”, “a still shot” etc. are perceived as low action shots.

Attributes:Intensity of ActivityDirection of ActivitySpatial distribution of ActivityTemporal Distribution of Activity

Applicationscontent repurposing, surveillance, fast browsing, video abstracting, video editing, content based querying


Intensity of motion

A high value of intensity indicates high activity while a low value of intensity indicates low activity.

For example, a still shot has a low intensity of activity while a “fast break” basketball shot has a high intensity of activity.


ExampleMotion_shot_00 (low motion)


ExampleMotion_shot_17 (high motion)


Extraction of intensity of motion activityFive intensity levels:

1) very low intensity; 2) low intensity; 3) medium intensity;4) high intensity; 5) very high intensity.

Use quantized standard deviation of the motion-vector magnitude in a video segment to compute the intensity of motion activity

* Jeannin, S., and A. Divakaran. MPEG-7 Visual Motion Descriptors, CSVT, Vol 11, No. 6, pp. 720-724, June 2001.


Direction of Activity (optional)

While a video shot may have several objects with differing activity, we can often identify a dominant direction.

The direction parameter expresses the dominant direction of the activity if any.

It is expressed by a three-bits integer that has a value corresponding to any of eight equally spaced directions.


ExampleMotion_shot_013 (direction of motion)


Extraction of direction of activityAngle from dominant MV

int quantize_angle(float f_angle) {int direction;

/* quantize angle using uniform 3-bits quantizationover 0-360 degrees i.e. 0,45,90,135,180,225,270,315 */

if((f_angle>=-22.5)&&(f_angle<22.5)) direction=0; (000)else if((f_angle>=22.5)&&(f_angle<67.5)) direction=1; (001)else if((f_angle>=67.5)&&(f_angle<112.5)) direction=2; (010)else if((f_angle>=112.5)&&(f_angle<157.5)) direction=3; (011)else if((f_angle>=157.5)&&(f_angle<202.5)) direction=4; (100)else if((f_angle>=202.5)&&(f_angle<247.5)) direction=5; (101)else if((f_angle>=247.5)&&(f_angle<292.5)) direction=6; (110)else if((f_angle>=292.5)&&(f_angle<337.5)) direction=7; (111)

return direction;}

y

θ

MV

x


Spatial distribution of ActivityIndicates whether the activity is spread across many regions or restricted to one large region.

It is an indication of the number and size of “active” regions in a frame.

For example, a talking head sequence would have one large activeregion,

While an aerial shot of a busy street would have many small active regions.

The spatial distribution parameter is expressed by three integers using a total of 16 bits.


ExampleMotion_shot_26 (spatial distribution of activity)


Temporal Distribution of ActivityExpress the variation of activity over a video duration.

Represented by a parameter expressed by five 6-bits integers

The histogram consists of 5 bins, where histogram bins with indexes N0, N1, N2, N3, and N4 correspond to Intensity value of 1, 2, 3, 4, and 5 respectively.

The histogram expresses the relative frequency of different levels of activity in the sequence as defined by the intensity element above.

Each value is the percentage of occurrences of the correspondingquantized intensity level uniformly quantized to 6 bits.


ExampleMotion_shot_032 (temporal distribution of activity)


Motion trajectoryDescribes the displacements of objects in time

Trajectory model is first- or second-order piecewise approximation along time, for each spatial dimension

Key-points:representing the successive spatio-temporal positions of the described objecta set of (x,y,t) for 2-D x,y trajectory or (x,y,z,t) for 3-D x,y,ztrajectory

By default, linear interpolation (first order) between key-points is used

Interpolating parameters can be added to specify nonlinear interpolations between key-points, using a second-order function of time


Example – motion trajectory

(50, 120, 5/30) (52, 120, 15/30) (54, 120, 25/30)


ExampleLinear interpolation (example code in Matlab)


ExampleLinear interpolation


ExampleSecond order (polynomial) interpolation


First and second order interpolationFirst order (linear) interpolation

Second order (polynomial) interpolation

Example of trajectory representation (one dimension)

* Jeannin, S., and A. Divakaran. MPEG-7 Visual Motion Descriptors, CSVT, Vol 11, No. 6, pp. 720-724, June 2001.


Extraction of motion trajectory descriptorAssuming the position of objects is known

May be generated through segmentation/tracking (difficult though)

Selection of Key-points and their FunctionsNot defined by MPEG-7 standardOption 1: Key-points can be selected using regular time intervals sampling (simplest way)Option 2 (bottom up): Starting from lots of key-points, recursively remove points until the interpolation error exceeds a given thresholdOption 3 (top down): Starting with one interval containing two points, recursively splits intervals into two at the position where the interpolation error is maximum


Option 1: regular time intervals sampling Black: true trajectoryRed: linear interpolation


Option 2: bottom up (part 1) Black: true trajectory










Option 2: bottom up (part 4) Black: true trajectoryRed: linear interpolation


Option 3: top down (part 1) Black: true trajectory






OutlineMotion features

Camera motionMotion activityMotion trajectory

Introduction to content based image and video retrieval

Text-based retrievalContent-based retrievalQuery formationFeature extractionSimilarity comparisonPerformance evaluation


Google video search


Google video search result part


A closer look


Text-based approach for image and video retrieval

Keywords annotation + text-based searching technique from traditional database management systemsAnnotation methods

By humanlabor intensive, subjective, content-sensitive and usually incomplete

To extract annotations from speech transcripts: Google video search

low accuracy May be improved with better machine understanding of natural languages (difficult!)

Automated machine understanding of images and videos“Semantic gap” between keywords and low level visual featuresChallenging research topic


More problems of text-based retrievalFails to query image content: certain visual properties (pattern, colors, shapes, textures) such as some textures, and shapes are different or nearly impossible to describe with text.Limited scope: pre-determined dictionary

http://www.ux.his.no/%7Etranden/brodatz/D6.gif















Bridging the semantic gapPattern recognition: develop a recognizer/classifier for each query concept

Eg, face detectorA simple and typical approach is feature extraction from images/video + classifier (eg, Support Vector Machines)Hard to generalize; impractical to develop classifiers for every possible query concept

Ontology (eg, broadcasting news)

ObjectsActions

Sites

Concepts

Outdoors IndoorsPerson

PeopleFace

NewsSubjectAnchor

Crowd

NewsMonolog

NewsDialog Studio

MPEG-7 Video Annotation Tool


IBM Video Annex Demohttp://www.alphaworks.ibm.com/tech/videoannex


Content based image and video retrieval

Emerged in early 1990s

Represent and index image/video with features (color, texture, shape, etc) extracted from the image/video content

Typical systems: QBIC, VisualSeek, SimPlicity, etc (one in this lect. Others in lect. 13)


Image retrieval system diagram

Query formation

Feature extraction

Image database

Feature extraction

Image data

Similarity comparison

Indexing & retrieval

Retrieval results

Relevance feedback

user Feature vectors

output

Feature vectors

Feature vectors


Query specificationA process of connecting user input with feature extraction to get feature vectors searchable in the database

Four major categories:Category browsing

Images are classified into different categories based on their semantic or visual content

Query by conceptUser supplied keyword -> concept (annotation); ie, text based

Query by sketchUser drawn sketch -> vectors

Query by exampleUser supplied example image -> vectors

Query specification

user vectorsFeature extraction


Category browsing (1)

* A. Vailaya, A. K. Jain, and H. J. Zhang, “On image classification: City images vs. landscapes,” Pattern Recognit., vol. 31, no. 12, pp. 1921–1936, 1998.


Category browsing (2)

* A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang, "Image Classification for Content-Based Indexing," IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117--130, 2001.


Image categorical pre-filtering may improve retrieval accuracy

(a) Query image

(b) top-ten retrieved images from 2145 city and landscape images

(c) top-ten retrieved images from 760 city images; filtering out landscape images prior to querying clearly improves the retrieval results.

* A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang, "Image Classification for Content-Based Indexing," IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117--130, 2001.


Limitations of categorical browsing

Ambiguity in categorizing images/videos

Images/videos found depending on the browsing path

Difficult to use if the number of categories is large

The ability to search is preferred in many applications


Query by concept

A. Natsev, A. Chadha, B. Soetarman, and J. S. Vitter. ``CAMEL: Concept Annotated iMagELibraries,'' Proc. of SPIE Electronic Imaging 2001: Storage and Retrieval for Image and Video Databases, San Jose, CA, Jan 2001.


Query by sketch

VisualSEEK user interface

The user sketches regions, positions them on the query grid assigns them properties of color, size and absolute location may also assign boundaries for location and size.

* John R. Smith , Shih-Fu Chang, VisualSEEk: a fully automated content-based image query system, Proc ACM Int Conf on Multimedia, p.87-98, Nov 18-22, 1996, USA


VisualSEEK examples

* John R. Smith , Shih-Fu Chang, VisualSEEk: a fully automated content-based image query system, Proc ACM Int Conf on Multimedia, p.87-98, Nov 18-22, 1996, USA


Query by example

Using shape feature in the above exampleW. Y. Ma and B. S. Manjunath, " NeTra: a toolbox for navigating large image databases", Multimedia Systems, vol.7, (no.3), Springer-Verlag, Berlin, Germany, pp.184-98, May 1999.


Image retrieval system diagram

Query formation

Feature extraction

Image database

Feature extraction

Image data



Retrieval results

Relevance feedback


output

Feature vectors

Feature vectors


Visual features - recapWhy visual features?

Manual labeling is very time consumingContent difficult to be described by text completelyMachine understanding of image/video is far from mature

What visual featuresExtractable from image/videoLearn from human visual system

Visual feature => feature vectors


Popular visual featuresColor

Color histogram (HSV, YCbCr,…)Color momentsDominant color

Texturestructural and statisticalTexture histogramEdge histogram

Shapeboundaries of objects

MotionCamera motion (PZT)Object motion


Content based retrieval system diagram

Query formation

Feature extraction

Image database

Feature extraction

Image data



Retrieval results

Relevance feedback


output

Feature vectors

Feature vectors


Similarity comparisonGiven two feature vectors I, J, the distance is defined as D(I,J) = f(I,J)Typical similarity metrics

Lp (Minkowski distance)Χ2 metric KL (Kullback-Leibler Divergence)JD (Jeffrey Divergence)QF (quadratic form)EMD (Earth mover’s distance)


K-nearest neighbour searchGiven a query vector vq, a brute-force k-nearest neighbour search method is (essentially): results = [ ]; maxD = infinity for each obj in the database {

dist = D(vobj,vq) if (#results < k or dist < maxD) {

insert (obj,dist) into results // results is sorted with

length <= k maxD = largest dist in results

} } Cost = Topen + NPTP + NTDNote: If q is an image from the database, we can use a pre-computed distance table to make this much faster.

* John Shepherd

Name Meaning Typically

N number of objects in the database

103 .. 1010

NP number of disk pages to hold stored objects

50 .. 1010

TP time to read a page from disk into memory

10ms

TD time to compute distance between two objects (using vectors)

100us (?)

Topen time to open a database file

10ms


Content based retrieval system diagram

Query formation

Feature extraction

Image database

Feature extraction

Image data



Retrieval results

Relevance feedback


output

Feature vectors

Feature vectors


Performance evaluationWe have three numbers: #system-correctly-retrieved-images, #system-retrieved-images, #relevant-images-in-DBPrecision = #system-correctly-retrieved-images / #system-retrieved-imagesRecall = #system-correctly-retrieved-images / #relevant-images-in-DB F# = (2 x precision x recall) / (precision + recall)

Recall

Precision


A tutorial questionSuppose we have 1000 images in the databaseWe want to retrieve images with concept “car”There are 200 “car” images in the databaseWe retrieved 250 images, and there are 150 “car”images in these 250 imagesCalculate precision, recall and F-number.


A demo retrieval systemMARVeL – from IBM Research

Exe fileResult: file:///c:/marvel/docs/html/main/0/index.html


IBM in TRECVID 2004

Visual features included color histograms, edge histograms, color moments, wavelet texture, co-occurrence texture, moment invariants etc.


Assignment2See http://www.cse.unsw.edu.au/~cs9519/assig-2/Submission deadline 4 Nov 2005Start early to avoid the late rush and possible conflicts with exams!

http://www.cse.unsw.edu.au/~cs9519/assig-2/


Some references

S. Jeannin and A. Divakaran. MPEG-7 visual motion descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6):720-724, Jun 2001.

B.S. Manjunath , Phillipe Salembier , Thomas Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, John Wiley & Sons, Inc., New York, NY, 2002 (Book)

Chapter 1, Fundamentals of content-based image retrieval, by F. Long, H.-J. Zhang and D. Feng, in book Multimedia Information Retrieval and Management, http://research.microsoft.com/asia/dload_files/group/mcomputing/2003P/ch01_Long_v40-proof.pdf

http://portal.acm.org/citation.cfm?id=863401&dl=GUIDE&coll=GUIDE&CFID=55055685&CFTOKEN=33480562




Documents

Lecture 10: Motion Features and Introduction to Content