January 19, 2006 Mike Christel [email protected] Carnegie Mellon University Digital Video Research Evaluation and User Studies with Respect to Video

January 19, 2006

Mike [email protected]

Carnegie Mellon University

Digital Video ResearchDigital Video ResearchEvaluation and User Studies with Respect to

Video Summarization and Browsing

© Copyright 2006 Michael G. Christel

Talk OutlineTalk Outline

• Summary of Informedia research work, 1994 – present Summary of Informedia research work, 1994 – present

• Introduction to “ecological validity” and “usability”Introduction to “ecological validity” and “usability”

• Overview of a series of Informedia user studiesOverview of a series of Informedia user studies

• Lessons learned and conclusions Lessons learned and conclusions

Note:Note: this slide set presented January 19, 2006 as the this slide set presented January 19, 2006 as the talk for an invited paper in the special session on talk for an invited paper in the special session on “Evaluating Video Summarization, Browsing, and “Evaluating Video Summarization, Browsing, and Retrieval Techniques” chaired by Ajay Divakaran, Retrieval Techniques” chaired by Ajay Divakaran, Mitsubishi Electronic Research Labs; Mitsubishi Electronic Research Labs; Multimedia Multimedia Content Analysis, Management, and Retrieval 2006Content Analysis, Management, and Retrieval 2006, , Proceedings of SPIE Vol. 6073Proceedings of SPIE Vol. 6073. .


Informedia Video Understanding Research

• Initiated by the National Science Foundation, DARPA, Initiated by the National Science Foundation, DARPA, and NASA and NASA under the Digital Libraries Initiative, 1994 the Digital Libraries Initiative, 1994 to 1998to 1998

• Continued funding via NSF, DARPA, National Library Continued funding via NSF, DARPA, National Library of Medicine, Library of Congress, NASA, National of Medicine, Library of Congress, NASA, National Endowment for the Humanities, ARDAEndowment for the Humanities, ARDA

• Over 10 terabytes of video data processedOver 10 terabytes of video data processed• Includes ~7500 news video broadcasts segmented into Includes ~7500 news video broadcasts segmented into

175,000+ stories, 3.2+ million shots175,000+ stories, 3.2+ million shots• News video from CNN, CCTV (Mandarin), LBC News video from CNN, CCTV (Mandarin), LBC

(Arabic); documentary video from the British Open (Arabic); documentary video from the British Open University, QED Communications, NASA, the Discovery University, QED Communications, NASA, the Discovery Channel, and other U.S. government agenciesChannel, and other U.S. government agencies

• http://www.informedia.cs.cmu.edu/http://www.informedia.cs.cmu.edu/


Application of Diverse TechnologiesApplication of Diverse Technologies

• Speech understanding for automatically derived Speech understanding for automatically derived transcriptstranscripts

• Image understanding for video “paragraphing”; Image understanding for video “paragraphing”; face, text and object recognitionface, text and object recognition

• Natural language for query expansion and content Natural language for query expansion and content summarizationsummarization

• Human computer interaction for video display, Human computer interaction for video display, navigation and reusenavigation and reuse

Integration overcomes limitations of each


Ecological ValidityEcological Validity

Ecological validityEcological validity – the extent to which the context of a – the extent to which the context of a user study matches the context of actual use of a user study matches the context of actual use of a system, such that system, such that • it is reasonable to suppose that the results of the study it is reasonable to suppose that the results of the study

are representative of actual usage, andare representative of actual usage, and• the differences in context are unlikely to impact the the differences in context are unlikely to impact the

conclusions drawn. conclusions drawn.

All factors of how the study is constructed must be All factors of how the study is constructed must be considered: how representative are the tasks, the considered: how representative are the tasks, the users, the context, and the computer systems?users, the context, and the computer systems?


Usability is the extent to which a computer system Usability is the extent to which a computer system enables users to achieve specified goals in a given enables users to achieve specified goals in a given

context of use effectively and efficiently while promoting context of use effectively and efficiently while promoting feelings of satisfaction. (ISO 9241)feelings of satisfaction. (ISO 9241)

UsabilityUsability

Usability is the extent to which a computer system Usability is the extent to which a computer system enables users to achieve specified goals in a given enables users to achieve specified goals in a given

context of usecontext of use effectively effectively andand efficiently efficiently while promoting while promoting feelings offeelings of satisfaction. satisfaction. (ISO 9241)(ISO 9241)

In the context of video shot-based retrieval (the In the context of video shot-based retrieval (the TRECVID context), this becomes:TRECVID context), this becomes:

Usability is the extent to which a video retrieval system Usability is the extent to which a video retrieval system enables users to identify relevant shots from news enables users to identify relevant shots from news

video effectively and efficiently while promoting feelings video effectively and efficiently while promoting feelings of satisfaction.of satisfaction.


TRECVID Evaluation ForumTRECVID Evaluation Forum

• Investigate content-based information retrieval from digital Investigate content-based information retrieval from digital video, with focus on video shot as the information unitvideo, with focus on video shot as the information unit

• TRECVID conducted 2001-2005+ with growing corpora and TRECVID conducted 2001-2005+ with growing corpora and participationparticipation• 2004: 24 topics, 61.16 hours U.S. news (33,367 shots)2004: 24 topics, 61.16 hours U.S. news (33,367 shots)• 2005: 24 topics, 84.7 hours U.S., Arabic, and Chinese news 2005: 24 topics, 84.7 hours U.S., Arabic, and Chinese news

(45,765 reference shots)(45,765 reference shots)

• Given a multimedia statement of the topic and the common Given a multimedia statement of the topic and the common shot boundary reference, return a ranked list of up to N shot boundary reference, return a ranked list of up to N shots from the reference which best satisfies the topic, shots from the reference which best satisfies the topic, where N=100 for 2002, N=1000 since 2003 where N=100 for 2002, N=1000 since 2003

• More information at More information at

http://www-nlpir.nist.gov/projects/trecvid/http://www-nlpir.nist.gov/projects/trecvid/


User Study: Thumbnails for Video SegmentsUser Study: Thumbnails for Video Segments


Text-based List for Video SegmentsText-based List for Video Segments


““Naïve” Thumbnail List (first shot’s image)Naïve” Thumbnail List (first shot’s image)


Query-based Thumbnail List (uses context)Query-based Thumbnail List (uses context)


Query-based Thumbnail Selection ProcessQuery-based Thumbnail Selection Process

1. Decompose video segment into shots.2. Compute representative image for each shot.

3. Locate query scoring words (shown by arrows).4. Use thumbnail image from highest scoring shot.

0

500

1000

Text First Query

Time (secs.)

0

100

200

300

400

Text First Query

Score (max =400)

0

25

50

75

Text First Query

Titles Browsed

1

3

5

7

9

Text First Query

1(terrible)-9(wonderful)

““Video Segment Thumbnail” Study ResultsVideo Segment Thumbnail” Study Results



Empirical Study Summary*Empirical Study Summary*

• Significant performance improvements for query-based Significant performance improvements for query-based thumbnail treatment over other two treatmentsthumbnail treatment over other two treatments

• Subjective satisfaction significantly greater for query-Subjective satisfaction significantly greater for query-based thumbnail treatmentbased thumbnail treatment

• Subjects could not identify differences between Subjects could not identify differences between thumbnail treatments, but their performance definitely thumbnail treatments, but their performance definitely showed differences!showed differences!

*See *See INTERACT '97INTERACT '97 conference paper by Christel, Winkler and Taylor for conference paper by Christel, Winkler and Taylor for more details.more details.


““Skim Video”: Presenting Skim Video”: Presenting SignificantSignificant Content Content


Empirical Study: SkimsEmpirical Study: Skims

DFL

DFS

NEW

Skim Audio Skim Image

DFL - “default” long-strideDFL - “default” long-stride

DFS – “default” short-strideDFS – “default” short-stride

NEW - selective skimNEW - selective skim

RND - same audio as NEW but with RND - same audio as NEW but with unsynchronized videounsynchronized video

Skim Study ResultsSkim Study Results

Subjects asked if Subjects asked if image was in the image was in the

video just seenvideo just seen

Subjects asked if Subjects asked if text summarizes text summarizes

info. that would info. that would be in full source be in full source

videovideo

6

7

8

9

10

RND DFS DFL NEW FULL

Images Correct(out of 10)

7.5

10

12.5

15


Phrases Correct(out of 15)



Skim Study Satisfaction QuestionnairesSkim Study Satisfaction Questionnaires

1

3

5

7

9


terrible-wonderful

frustrating-satisfying

dull-stimulating

wonderful…

terrible...

1

3

5

7

9


poor-excellentvideopoor-excellentaudio


Skim Study Results*Skim Study Results*

1996 “selective” skims performed no better than 1996 “selective” skims performed no better than subsampled skims, but results from 1997 study show subsampled skims, but results from 1997 study show significant differences with “selective” skims more significant differences with “selective” skims more satisfactory to userssatisfactory to users

• audio is less choppy than earlier 1996 skim workaudio is less choppy than earlier 1996 skim work

• synchronization with video is better preservedsynchronization with video is better preserved

• grain size has increasedgrain size has increased

*See *See CHI '98CHI '98 conference paper by Christel, Smith, Winkler and Taylor for conference paper by Christel, Smith, Winkler and Taylor for more details.more details.


Evaluating Video SurrogatesEvaluating Video Surrogates

• ““Video Surrogate” == set of text, image, audio, and video that can Video Surrogate” == set of text, image, audio, and video that can serve as a condensed representation for the full video document serve as a condensed representation for the full video document

• Evaluate via formal empirical studies as discussed in prior slidesEvaluate via formal empirical studies as discussed in prior slides

• Other techniques used for evaluating and refining surrogates:Other techniques used for evaluating and refining surrogates:• transaction logstransaction logs• contextual inquirycontextual inquiry• heuristic evaluationheuristic evaluation• cognitive walkthroughscognitive walkthroughs• ““think aloud” protocolsthink aloud” protocols

• See See ACM Multimedia 2004ACM Multimedia 2004 conference paper “Finding the Right conference paper “Finding the Right Shots: Assessing Usability and Performance of a Digital Video Shots: Assessing Usability and Performance of a Digital Video Library Interface” for discussion on using such techniques with Library Interface” for discussion on using such techniques with TRECVID tasksTRECVID tasks


Moving from Summarization to Browsing Moving from Summarization to Browsing

• Initial Informedia work developed video surrogates to Initial Informedia work developed video surrogates to help user gauge help user gauge precisionprecision: does this video fit my need: does this video fit my need

• Traditional video query returns list of video surrogates, Traditional video query returns list of video surrogates, but locating meaningful info in list may be problematic:but locating meaningful info in list may be problematic:• too much information is returnedtoo much information is returned• the list view neither communicates the meaning of the list the list view neither communicates the meaning of the list

as a whole nor the multiple relationships between items in as a whole nor the multiple relationships between items in the list the list

• different users have different information needs different users have different information needs

• Later Informedia work developed information Later Informedia work developed information visualizations – video collages – to help user improve visualizations – video collages – to help user improve recall recall and address these problemsand address these problems


Timeline Collage for “James Jeffords”, 2001Timeline Collage for “James Jeffords”, 2001


Timeline Collage, Jeffords, just May 2001Timeline Collage, Jeffords, just May 2001


Evaluating Video CollagesEvaluating Video Collages

• First step: Is information content in collage reasonable?First step: Is information content in collage reasonable?• Compare text in collage to text in other sources (e.g., Compare text in collage to text in other sources (e.g.,

Google pages) and “truth” (e.g., biography web pages) to Google pages) and “truth” (e.g., biography web pages) to determine precision and recalldetermine precision and recall

• Refer to Refer to ACM Multimedia 2002ACM Multimedia 2002 paper for details paper for details

• What’s missing: measures of end-user What’s missing: measures of end-user acceptance/satisfaction, efficiency, and effectiveness, acceptance/satisfaction, efficiency, and effectiveness, i.e., the ecological validity of users interacting with the i.e., the ecological validity of users interacting with the video collagevideo collage

• Within-subjects study conducted with 20 university Within-subjects study conducted with 20 university students using four versions of timeline collages students using four versions of timeline collages • With and without text lists beneath timeline areaWith and without text lists beneath timeline area• With and without thumbnail imagery in timeline areaWith and without thumbnail imagery in timeline area


Choosing the Task and Initiating the StudyChoosing the Task and Initiating the Study

• Task was chosen to represent the broad fact-gathering Task was chosen to represent the broad fact-gathering work supported by information visualization interfaceswork supported by information visualization interfaces

• Prior work with students and a digital video library showed Prior work with students and a digital video library showed that assignments frequently centered on:that assignments frequently centered on:• assembling answers to “who,” “what,” “when,” “where,” “why” assembling answers to “who,” “what,” “when,” “where,” “why” • creating visually appealing cover pages communicating the creating visually appealing cover pages communicating the

main theme of a report main theme of a report

• The task was defined as completing a celebrity report for The task was defined as completing a celebrity report for 2001 “people in the news” drawn from a test set of 242001 “people in the news” drawn from a test set of 24

• Usability metrics: Usability metrics: • Efficiency (the time to complete the celebrity report)Efficiency (the time to complete the celebrity report)• Effectiveness (automatic and manual grading of report)Effectiveness (automatic and manual grading of report)• Satisfaction (questionnaires and user rankings)Satisfaction (questionnaires and user rankings)


Celebrity Report TemplateCelebrity Report Template

Celebrity report, Celebrity report, filled in for “Jane filled in for “Jane Swift”Swift”


Video Collage Study Efficiency ResultsVideo Collage Study Efficiency Results


Video Collage Study: Effectiveness ResultsVideo Collage Study: Effectiveness Results


Collage Study, Automatic IR MeasuresCollage Study, Automatic IR Measures


Collage Study, Satisfaction + DiscussionCollage Study, Satisfaction + Discussion

• Collages with imagery had lower precision, worse efficiency than Collages with imagery had lower precision, worse efficiency than collages without imagery, but yet collages with imagery were collages without imagery, but yet collages with imagery were favoredfavored

• Improvements suggested by study: “brushing” text to integrate it Improvements suggested by study: “brushing” text to integrate it with rest of collage, better image selection and layoutwith rest of collage, better image selection and layout


Difficulty of Evaluating Video CollagesDifficulty of Evaluating Video Collages

• Ecological validity: would news video corpus be consulted Ecological validity: would news video corpus be consulted to learn details about a celebrity?to learn details about a celebrity?

• Establishing a control, e.g., perhaps the top-rated news Establishing a control, e.g., perhaps the top-rated news story mentioning the celebrity rather than a biography story mentioning the celebrity rather than a biography web pageweb page

• Is the automated processing tuned to satisfy the Is the automated processing tuned to satisfy the experimental task and not real-world tasks?experimental task and not real-world tasks?

• Is the input video data tuned to guarantee success with Is the input video data tuned to guarantee success with certain tasks?certain tasks?

• One means of addressing these difficulties: have a One means of addressing these difficulties: have a community-wide forum for evaluating video retrieval community-wide forum for evaluating video retrieval interfaces and determining ecological validity interfaces and determining ecological validity TRECVID TRECVID


Benefits offered by TRECVID benchmarkBenefits offered by TRECVID benchmark

• Topics are defined by NIST to reflect many of the sorts Topics are defined by NIST to reflect many of the sorts of queries real users poseof queries real users pose

• The data set is real and representative The data set is real and representative

• The processing efforts are well communicated with a The processing efforts are well communicated with a set of rules for all to follow set of rules for all to follow

• Remaining question of validity: does the subject pool Remaining question of validity: does the subject pool represents a broader set of users, with university represents a broader set of users, with university students and staff for the most part comprising the students and staff for the most part comprising the subject pool for many TRECVID research groupssubject pool for many TRECVID research groups


TRECVID for Evaluation WorkTRECVID for Evaluation Work

• TRECVID provides a public corpus with shared TRECVID provides a public corpus with shared metadata to international researchers, allowing for metadata to international researchers, allowing for metrics-based evaluations and repeatable experiments metrics-based evaluations and repeatable experiments

• An evaluation risk with over-relying on TRECVID is An evaluation risk with over-relying on TRECVID is tailoring interface work to deal solely with the genre of tailoring interface work to deal solely with the genre of video in the TRECVID corpus, e.g., U.S. news (2004 video in the TRECVID corpus, e.g., U.S. news (2004 TRECVID corpus)TRECVID corpus)• This risk is mitigated by varying the TRECVID corpusThis risk is mitigated by varying the TRECVID corpus

• Another risk: topics and corpus drifting from being Another risk: topics and corpus drifting from being representative of real user communities and their tasksrepresentative of real user communities and their tasks

• Exploratory browsing interface capabilities supported by Exploratory browsing interface capabilities supported by video collages and other information visualization video collages and other information visualization techniques not evaluated via IR-influenced TRECVID techniques not evaluated via IR-influenced TRECVID


Reflections on Informedia User Study WorkReflections on Informedia User Study Work

• Efficiency, effectiveness, and satisfaction are three Efficiency, effectiveness, and satisfaction are three important HCI metrics, with overlooking any of them important HCI metrics, with overlooking any of them reducing the impact of the user study reducing the impact of the user study

• A mix of qualitative (observation, think-aloud protocols) A mix of qualitative (observation, think-aloud protocols) and quantitative (transaction logs, click stream analysis, and quantitative (transaction logs, click stream analysis, task times) metrics are useful, with quantitative data task times) metrics are useful, with quantitative data confirming observations and qualitative data helping to confirming observations and qualitative data helping to explain why explain why

• Discount usability techniques definitely offer benefit, to Discount usability techniques definitely offer benefit, to both iteratively improve video retrieval interfaces before both iteratively improve video retrieval interfaces before committing to a more formal empirical study, and also committing to a more formal empirical study, and also to confirm that changes put in place as a result of a to confirm that changes put in place as a result of a study had their intended effects study had their intended effects


Informedia User Study Reflections, contd.Informedia User Study Reflections, contd.

• Informedia interface work endeavors to leverage the Informedia interface work endeavors to leverage the intelligence of the user to compensate for deficiencies intelligence of the user to compensate for deficiencies in automated content-based indexing in automated content-based indexing

• Goals for future Informedia interface evaluation work Goals for future Informedia interface evaluation work with greater impact:with greater impact:• Dealing with additional genres beyond news and Dealing with additional genres beyond news and

documentariesdocumentaries• Experimenting with browsing tasks in addition to retrieval Experimenting with browsing tasks in addition to retrieval • Pursuing longitudinal studies with populations other than Pursuing longitudinal studies with populations other than

just university students just university students


Challenges for Video Summarization/Browsing Challenges for Video Summarization/Browsing

• Addressing the semantic gap between low-level Addressing the semantic gap between low-level features and high-level user information needs for video features and high-level user information needs for video retrieval, retrieval, especiallyespecially when the corpus is not well when the corpus is not well structured and does not contain narration audiostructured and does not contain narration audio

• Demonstrating that techniques from the computer Demonstrating that techniques from the computer vision community scale to materials outside of the vision community scale to materials outside of the researchers’ particular test sets, and that information researchers’ particular test sets, and that information visualization techniques apply more generally beyond a visualization techniques apply more generally beyond a tested experimental task tested experimental task

• Leverage the intelligence and goals of human users: be Leverage the intelligence and goals of human users: be pulled by user-driven requirements rather than just pulled by user-driven requirements rather than just pushing technology-driven solutions pushing technology-driven solutions


CreditsCredits

Many members of the Informedia Project and CMU research Many members of the Informedia Project and CMU research community contributed to this work; a partial list appears here: community contributed to this work; a partial list appears here:

Project Director:Project Director: Howard Wactlar Howard Wactlar

User Interface:User Interface: Mike Christel, Ron Conescu, Chang Huang, Neema Mike Christel, Ron Conescu, Chang Huang, Neema Moraveji, Adrienne Warmack Moraveji, Adrienne Warmack

Image Processing:Image Processing: Takeo Kanade, Yuichi Nakamura, Norman Takeo Kanade, Yuichi Nakamura, Norman Papernick, Shin’ichi Satoh, Henry Schneiderman, Michael SmithPapernick, Shin’ichi Satoh, Henry Schneiderman, Michael Smith

Speech and Language Processing:Speech and Language Processing: Alex Hauptmann, Dorbin Ng, Alex Hauptmann, Dorbin Ng, Ricky Houghton, Rong Jin, Michael WitbrockRicky Houghton, Rong Jin, Michael Witbrock

Machine Learning/Multimedia Classification: Machine Learning/Multimedia Classification: Robert Chen, Wei-Robert Chen, Wei-Hao Lin, Jun Yang, Rong YanHao Lin, Jun Yang, Rong Yan

Informedia Library Essentials:Informedia Library Essentials: Bob Baron, Colleen Everett, Melissa Bob Baron, Colleen Everett, Melissa Keaton, Bryan Maher, Scott StevensKeaton, Bryan Maher, Scott Stevens

Documents

January 19, 2006 Mike Christel [email protected] Carnegie Mellon University Digital Video Research Evaluation and User Studies with Respect to Video