39
Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

Embed Size (px)

Citation preview

Page 1: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

Alan F. SmeatonCentre for Digital Video Processing

Dublin City University

The TREC-10 Video Track: Overview

Page 2: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

2

Video track session organisation

45-minute overview talk covering:– brief introduction/overview to digital video– description of the tasks in the video track

including the difficulties/surprises encountered– brief bird’s eye view of who-did-what

3 x 20-minute presentations from 3 groups:– Carnegie Mellon University (Infomedia)– IBM T. J. Watson Research Center– CWI, TNO, U. of Amsterdam and University of

Twente … the “Lowlands”

Page 3: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

3

Introduction to Digital Video Encoding

– Video is 25/30 fps of synchronised images and audio;

– To display a single image of TV-quality video requires 720 Kbytes, so without compression this is 100 GBytes for a 90 minute movie -> video must be compressed !!!

– There are formats such as .AVI, QuickTime, .ram, .rm (Real Networks), .wma, .wmp, but the ones that matter are the MPEG family;

– Before we look at IR on video we should have some understanding of how video is encoded;

Page 4: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

4

Video Encoding principles

– All video encoding standards use motion compensation, identifying motion between adjacent frames and transmitting only the differences … except across shot bounds;

– Doing this on pixels is too fine-grained because cameras boom, tilt, pan, zoom, shake, and objects move, so frames are divided into pixel aggregates called “blocks” and motion compensation is computed between equivalent blocks;

– This allows a graceful and effective encoding of deliberate camera and object motion;

Page 5: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

5

Evolution of MPEG standards

– MPEG-1 (’93) ... for 1.5 Mbit/s transmission, near VHS-quality. Encoding based on macro-blocks;

– MPEG-2 (’95) ... allows data rates up to 100 MBit/s. Used for digital TV, DVDs and broadcasting units;

– MPEG-4 (’01) ... suitable for lower data rates between 10 KBit/s and 1 MBit/s. Good quality object-based encoding that will enable object-based interactions for users, but out of reach for now;

– MPEG-7 under development, basically will be an XML-like content descriptor stream;

Page 6: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

6

MPEG-1 encoding

– 352x288 pixels per frame @ 25 fps giving near-VHS quality at 1.5 Mbps; can be decoded on an (old) PC, encoding requires hardware;

– MPEG stream has I-, P- and B-frames in a given pattern;

– I-frame is a JPG image; each frame divided into 16x16 pixel macroblocks and in B- and P-frames, equivalent macroblocks are compared and a motion vector generated if possible;

…………...

I-frame ((B-frame)*, P-frame, (B-frame)*)* I-frame

Page 7: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

7

– I-, P- and B-frames form a pattern depending on the encoder used … ours has an I-frameevery 12 frames (2 per sec) but it does not have to be like this;

– Encoders are not perfect and the 396 motion vectors in a frame (1 per macroblock) can sometimes be incorrect and have rogues;

– MPEG-2 is the same principle except 720x576 pixels and is used for digital TV;

– MPEG-4 is object based compression, based on identifying, tracking and encoding object layers which are rendered on top of each other, with huge potential for interaction;

MPEG-1 encoding

Page 8: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

8

What’s great about MPEG ?

– MPEG/video standards are great for …Recording

Authoring Compression

Transmission Playback

– MPEG/video standards do nothing for …Searching

Browsing Linking

Summarising

Page 9: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

9

Who needs video searching ?

– Journalists, producers, film & TV program makers need to search;

– BBC archive has +500k queries plus 1M new items … per year;

– From the BBC …– Police car with blue light flashing – Government plan to improve reading standards – Two shot of Kenneth Clarke and William Hague – Bullying at school – X ray machines at airports – Failing schools – UN peace keeping forces in Angola – Interior of UN Security Council– Actuality of UN General Secretary Kofi Annan – Exteriors of commerical

Page 10: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

10

Who else needs video searching ?

– This can be done with keyword captions and indices;

– However, the development of video navigation is not necessarily about replacing or improving existing applications … its about creating new ones;

– Video content is plentiful … its now available digitally … we can work on it directly … so it follows;

– The video track set out to promote progress in content-based retrieval directly from digital video;

Page 11: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

11

Video track =/= SDR track

– For any binary “blob”, we can assign captions;– If the blob has an audio track, we could do

speech recognition -> SDR track

– From the outset we didn’t want this to be just “SDR with pictures”, though, in deployment, video asset management benefits from SDR;

– We want this track to explore and progress content access on the video aspects, of which SDR is a part;

Page 12: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

12

So what can be done for video “navigation” ?

– Video “programmes” are structured into logical scenes, and physical shots

– If dealing with text, then the structure is obvious: – paragraph, section, topic, page, etc.

– All text-based indexing, retrieval, linking, etc. builds upon this structure;

– If dealing with video, then first it needs to be structured, automatically;

– Automatic shot boundary detection and selection of representative keyframes is usually the first step;

Page 13: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

13

Typical automatic structuring of video

A set of keyframes

a video document

Keyframe browser combined with transcript or object-based search

Page 14: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

14

TREC Video Track

– Participating groups were invited to test their work in one or more of the following:

• Shot boundary detection• Automatic or interactive search for known

items• Automatic or interactive search for general

search topics;

Page 15: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

15

Data

– 11.2 hours of MPEG-1 encoded video from:

– NIST DVD of US Govt. videos– A range of heterogeneous material from

OpenVideo Project (UNC)– BBC stock footage

– Common characteristic … free for us, or in the public domain;

– Getting test data is really difficult and there were many cul-de-sacs;

Page 16: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

16

Shot Bound Detection task

– A non-trivial task because of fades and dissolves, though cuts are easier;

– SBD used a subset (half) of dataset, with 3,006 shot transitions, 2/3 were hard cuts, 1/3 were gradual transitions;

– Submissions were compared to manual reference data by comparing (video file, frame number), calculating the number of inserted and deleted gradual transitions, the correction rate, deletion rate, insertion rate, error rate, quality index, correction probability, recall, and precision;

– Needless to say, these measures were inherited from previous outside work;

Page 17: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

17

Shot Bound Detection task

– Evaluation of SBD sounds easy, but that’s where the fun started !

– Different MPEG-1 decoders produce different frame numbering from the same source;

– Decoders suddenly jump a few frames … in mid-file, so we had to be flexible in defining matches for evaluation:

Reference

Submission

hard cut gradual transition

Matches if: 5 is at least 1/3 of the longer and 1/2 of the shorter

Page 18: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

18

Shot Bound Detection task

– Why SBD ?

– It is a fundamental primitive of most/all work in content-based video navigation

– It provides a stepping-stone for those who want to break into this area

Page 19: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

19

Search task

– Wanted to make this multimedia IR so topics were rich formulation of an information need;

– Topics formulated by 12 participant groups, at least 5 each, and NIST did some pruning;

– Topics include a text description of information need plus video clips, still images and/or audio illustrating what is needed;

– Of 74 topics, 2/3 had video (2 or 3 clips each), 1/3 had image (2 each) and 10 had audio, all taken from outside the test data;

– Half the topics are “known item” searches, where topics included the KIs, half are unknown or “general”;

Page 20: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

20

Known items search task

Evaluation of video search task is like QA

Sdfasdf asdf asdf asdasdf

Sdfasdf asdf asdf asdasdf

Sdfasdf asdf asdf asdasdf

Sdfasdf asdf asdf asdasdf

Sdfasdf asdf asdf asdasdf

Sdfasdf asdf asdf asdasdf

Sdfasdf asdf asdf asdasdf

Precision / Recall

… except its more difficult ‘cos we are dealing with moving targets of variable size, and inexact matching

Page 21: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

21

– We’ve assumed frame-level accuracy in submitting clips and defining ground truth, but we’ve parameterised the matching;

Known items search task

submitted

intersection

known item

KI coverage is how much KI was foundRI coverage is how much of submitted was on target

Page 22: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

22

Known items search task

– We use pre-defined settings for KI / RI of 0.33 and 0.66;

– We’ve calculated Precision and Recall for 4 combinations of these, reporting 0.66/0.33 in the notebook;

– Paul’s analysis shows stability of submitted runs across different parameter combinations;

– Lesson learned … we could use pre-defined shots with bounds for retrieval and evaluation, but that’s not real life;

Page 23: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

23

General items search task

– Same as known item, except evaluation by NIST information analysts;

– Two sets of judgements completed, with 84.6% agreement !

– Evaluation measure calculated was precision but ongoing work on calculating ako recall;

– Lessons learned … topics are very rich and relationship between text and video/image/audio not always obvious;

Page 24: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

24

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

Who did what …

Page 25: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

25

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

1. Carnegie Mellon University (US)

Search: with, and without, the Sphinx speech recognition system, both automatic and interactive searches; Minor changes to the Informedia system;Used colour histogram matching, texture, video OCR, face detection and speech recognition;

Who did what …

Page 26: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

26

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

2. CLIPS IMAG Grenoble (Fr)

SBD: where there is significant motion between adjacent frames, uses motion compensation based on optical flow, and a photo flash detector, and a dissolve detector;

Who did what …

Page 27: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

27

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

3. Dublin City University (Irl)

SBD: some work on macroblock patterns but only on partial dataset;

Search: interactive, to evaluate the effectiveness of 3 different keyframe browsers … timeline, slideshow, hierarchical ... used 30 real users, each doing 12 topics using 3 browsers each;

Who did what …

Page 28: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

28

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

4. Fudan University (China)

SBD: used frame differences based on luminance and colour histograms;Search: for 17 topics, calculated camera motion, face detection and recognition, video text and OCR, speaker recognition and clustering, speech recognition and speaker gender detection;

Who did what …

Page 29: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

29

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

5. Glasgow University (UK)

SBD: Examining the frequency of occurrence of macroblock types on compressed files, technique not tuned to gradual transitions;

Who did what …

Page 30: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

30

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

6. IBM Groups Almaden and T.J. Watson (US)

SBD: used the IBM CueVideo toolkit;Search: with, and without, speech recognition, automatic and interactive searching tasks; based on the semi-automatic construction of models for different kinds of scenes, events and objects - extensive experiments;

Who did what …

Page 31: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

31

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

7. Imperial College (UK)

SBD: used colour histograms but by comparisons across a range of frame distances, instead of the usual adjacent frames;

Who did what …

Page 32: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

32

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

8. Johns Hopkins University (US)

SBD: based on colour histogram and luminance;Search: treated video as a sequence of still images and used colour histograms and texture to match query images and topic video keyframes vs. video data keyframes; no processing of text or audio; no previous video experience;

Who did what …

Page 33: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

33

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

9. Lowlands Group (NL)

Search: both automatic and interactive searching, used output from CMU speech processing plus recognition of video text via OCR, detector for the number of faces on-screen, camera motion (pan, tilt, zoom), scene detectors, and models of lazy, interactive users;

Who did what …

Page 34: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

34

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

10. Microsoft Research Asia (China)

SBD: working on uncompressed video, 2 techniques for hard and for gradual shots, integrated together; very elaborate SBD technique;

Who did what …

Page 35: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

35

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

11. University of Maryland (US)

SBD: based on examining macroblock and DCT coefficients;Search: temporal colour correlogram … a colour histogram with the spatio-temporal arrangement of colours considered … is used to automatically retrieve from video topic examples;

Who did what …

Page 36: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

36

– Carnegie Mellon University– CLIPS IMAG Grenoble– Dublin City University– Fudan University– Glasgow University– IBM Almaden and T.J. Watson– Imperial College– Johns Hopkins University– Lowlands Group (UvA, Twente, CWI, TNO)– Microsoft Research China– University of Maryland– University of North Texas

12. University of North Texas (US)

Search: did 13 of the general search topics; used a keyframe extractor and an image retrieval tool to match topics which had exemplar video or images;

Who did what …

Page 37: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

37

Operational comment

– All data exchange, topic formulation, submission of runs, evaluation software, used XML which involved the track participants defining DTDs and some people provided stylesheets … thanks everyone;

Page 38: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

38

Outcomes and results

– The whole track framework was … different … to previous TREC tracks … many surprises and issues, some of which were resolved;

– We may do identification of other primitives such as camera motion, overlaid text, faces, sound types

– We can’t do many head-to-head comparisons across participants, though CMU, IBM & Lowlands all used similar detectors, and speech, though can’t use aggregate numbers;

– We’ve gone into new territory here, we may not have got many answers this year except from within groups, but we’ve laid groundwork;

Page 39: Alan F. Smeaton Centre for Digital Video Processing Dublin City University The TREC-10 Video Track: Overview

39

Acknowledgements & thanks

– for video data:– Richard Wright at BBC for stockshot footage– Gary Marchionini and OpenVideo project– NIST for DVD of NIST materials

– to those on the email list:– for bursts of enthusiastic emails on many issues

over the last year, and for the topics– Georges Quénot for placing work on SBD

evaluation by the ISIS/GT10 research group at our disposal

– to participants … for doing it, with gusto– to Paul Over and Ramazan Taban for making it

happen