Video Search Engines and Content-Based Retrieval Steven C.H. Hoi CUHK, CSE 18-Sept, 2006

Preview:

Citation preview

Video Search Engines and

Content-Based Retrieval

Steven C.H. Hoi

CUHK, CSE

18-Sept, 2006

Outline

Video Search Engines

Content-Based Video Retrieval

Video Search Engines

A survey of state-of-the-arts

Introduction

Who are doing video search engines?

Top text search engines5.6 billion searches

07/2006

Introduction Google

Introduction Yahoo

Introduction MSN/Live Search

Introduction YouTube

Business Models Web Advertising

Site Volume, or keyword customized Video Ads

Disable controls (MSN) Subscription

MLB, Real Download to own

iTunes, Movie Rental

Limited time, number of plays Other

Desktop Media Search Media player (jukebox) Media Monitoring Media Asset Management

Types of video Sites Content Originators

Major Broadcasters Affiliates, Local News Major League Baseball

Syndication, Aggregation, “Internet Broadcasters” Rental, purchase, advertising, subscription MSN, Google, iTunes ROO Media, FeedRoom

Movie and Video Download Share portals

Consumer content, blogs YouTube, Putfile, Vsocial, Google, Akimbo

Traditional Search Engines (Crawl) / “RSS” Yahoo, Blinkx

Other Public (Internet Archive) Media Monitoring, asset management systems

Video Search Challenges

Current Video Search Engines

Metadata File type and context Media file attributes

Size, length Structured global metadata

RSS content description

Content Content Indexing

Search within a video Full text of dialog Image or video content

Automated Content Indexing

Current Video Search Engines

Content Search Engines

Keyword search with transcripts from speech recognition

Content-Based Video Search Engine

Architecture

Content-Based Video Search Engine

Video Processing

Content-Based Video Search Engine

Research ChallengesSpeech RecognitionShot Boundary DetectionVideo Story Segmentation Concept DetectionMulti-modal Fusion for Ranking

Text/ASR, Audio/Speech, Visual, etc.

Content-Based Retrieval

Our Research ProblemLearning to rank video shots for automatic

content-based search tasks !

ChallengesMulti-Modal Information FusionSmall Sample Learning (a few pos. & no neg.)Learning on large-scale datasets

Multi-modal and Multi-scale Ranking Framework

Main IdeasRepresenting video structures by graphsUsing semi-supervised learning to address

small labeled sample learning problemFusing Multi-modal information by Harmonic

learning over graphsMulti-scale ranking for achieving efficient

performance on large-scale datasets

Multi-modal and Multi-scale Ranking Framework

Graph-based Modeling

StoryText

Shot

Multi-modal and Multi-scale Ranking Framework

Semi-Supervised Learning on GraphTo find an optimal real-valued function

g: VR on the graph GTo minimize a quadratic energy function:

Using Gaussian field and Harmonic property of Spectral Graph Theory (J. Zhu’s ICML’03), a harmonic function g can be found:

Multi-modal and Multi-scale Ranking Framework

Semi-Supervised Learning on GraphLet

The solution of the harmonic function g can be expressed in matrix operations:

Multi-modal and Multi-scale Ranking Framework

Multi-Modal Fusion over GraphTo combine text information into SSL on visual

modality, we consider the text inputs as the attached nodes on the visual graph:

Visual - g

Text - f

Multi-modal and Multi-scale Ranking Framework

ChallengesNumber of examples in database: N is large

For examples:TRECVID 2005: Rep. Key-Frames N = 45,765TRECVID 2006: Rep. Key-Frames N = 79,487

How to do Semi-Supervised Learning?!

Multi-modal and Multi-scale Ranking Framework

Multi-Scale RankingLearning ranking through multi-scale rerankingEach stage is associated with different

computational costsIn our solution, four ranking stages include:

Ranking by Text Retrieval using Language ModelsRe-ranking by NN fusing Text and VisualRe-ranking by SVM fusing Text and VisualRe-ranking by multi-modal Semi-supervised Learning

Top M related Stories

Text

Top N2 related Shots

Text + Visual NN

SVM/KLR

Top N3 related Shots

Top N4 related Shots

SSR

Video Stories

Video Shots

Top N1 related Shots

Text Processing

VideoProcessing

User’s Queryreturn top K shots

Multi-modal Fusion

Mu

lti-sc

ale

Ra

nk

ing

Image Processing

Raw

Video C

lips / Stream

s

Semi-Supervised Ranking

Supervised Ranking

Benchmark Evaluations

DatasetTRECVID 2005Test: 140 video clips, 45,765 rep. key frames24 queriesA query example:

<videoTopic num="0152">

<textDescription text="Find shots of Hu Jintao, president of the People's Republic of China" />  </videoTopic>

Benchmark Evaluations Text-only Retrieval

No Pseudo-Relevance Feedback (No-PRF)

With Pseudo-Relevance Feedback (PRF)

Evaluation of Language Models

0

0.02

0.04

0.06

0.08

0.1

MA

P No-PRF

PRF Language Models TF-IDF Okapi KL-JM KL-DIR KL-ABS

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Text-only Results

MA

P

IBM

Columbia

TRECVID-Max

CUHK

Benchmark Evaluations Visual Features

Color Grid Color Moment 3*3 grid, 81-dimensions

Edge Edge Direction Histogram 36 bin+1, 37-dimensions

Texture Gabor Moments 5*8=40, 3 moments,120

dimensions

238 dimensions in total

Normalized Comparison

0

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100 120

GCM

EDH

Gabor

GCM+Gabor+EDH

COREL Benchmark Photos

Benchmark Evaluations

Multi-modal Retrieval (Text + Visual)Text-only retrievalText + NN (Text + Visual)Text + SVM (Text + Visual)MMMS (Text + Visual)

Benchmark Evaluations

MAP Num_Ret Improvement

Text 0.0903 1669 0%

Text+NN 0.1034 1705 +14.51%

Text+SVM 0.1083 1764 +19.93%

MMMS 0.1157 1764 +28.13%

Average Performance on TRECVID 2005 Dataset

Evaluation Results

Benchmark Evaluations

0.095

0.1

0.105

0.11

0.115

0.12M

AP

IBM (T+V+M)

CUHK-MMMS

Columbia (V+T+M)

IBM (V+T)

Average performance of 24 queries

Comparison with other approaches

Related Work

IBM Solution SVM + NN + Multiple Instance Learning

Columbia solutionInformation-Theoretical Clustering Approach

CMU SolutionQuery-Class Dependent Weighting Ranking

Conclusion

A tutorial of video search engines Research contributions

A Unified framework of Multi-Modal and Multi-Scale Ranking for video retrieval

Graph-based Modeling of video structuresSemi-Supervised Learning for Multimodal

RankingMaking SSL practical for large-scale problemsPromising empirical results…

Future Work

Research is in progress, tough ahead…

Any suggestions or comments?