8
Creating Video Summaries Amy Pavel CS294-10 Final Project [email protected] ABSTRACT Many presentation videos such as lectures, keynotes and TED talks are available online. Viewers often navigate the content of the presentation video by manually scrubbing on the video timeline. Recently, some presentation authors have summa- rized their presentation videos to allow users to skim and browse the content of the video. Users can read the video summary captions or click the portion they are interested in to view more detail. However, the authors currently do this by hand with only the raw video footage. This requires man- ually scrubbing, annotating relevant content, writing sum- maries and selecting keyframes. For this project I designed an interface that links the video and transcript to help authors create skimmable video summaries. I also investigated meth- ods for selecting key frames and summarizing transcript text. Author Keywords Video summaries; text summaries; visualization INTRODUCTION Presentations are often given in person to an audience with a rehersed script and visual aids. To increase the availabil- ity of the presentation material, presentation authors or other organizers often publish the presentation content online. As streaming and downloading speeds improve, and popularity of online education increases, we are seeing an increasing number of these videos online. For example, the TED confer- ence series has over 1500 published presentation videos and the edX site now includes 110 courses worth of lecture videos [2, 5]. Watching an entire presentation video takes around 20 - 90 minutes. This is time consuming. If a user rather skim the video or is only interested in a short segment they may either usually read a short caption or manually scrub through avail- able content. This could happen when the viewer is revisiting a lecture video and wishes to re-watch only a portion that s/he did not recall. Or, the user could be referencing a new talk in which only a small portion is relevant. For instance, a lecture video where only a portion pertained to the user’s research or Submitted to CS294-10. Class final project. a b Figure 1. An example video with an indexable summary. (a) shows the video player which can be scrubbed to index the video at a point manu- ally. The current method for viewing and indexing videos relies on this type of video player. (b) The video summary includes a thumbnail and caption for each summarized portion of the video. By clicking on any thumbnail or caption, the video will begin playing at the beginning of the summarized portion so that the user can gain more detail. a company product release where only a a few of the products discussed are useful to the user. Some presentation authors have published videos of their pre- sentations accompanied by a video summaries with captions and thumbnails [26, 9]. In these examples the user can skim the text summary and thumbnails to get an overview of the presentation content. One example allows the user to use the summary to index into the corresponding portion of the video. This way the user can find more detail for interesting portions [26]. 1

Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

Creating Video Summaries

Amy PavelCS294-10 Final [email protected]

ABSTRACTMany presentation videos such as lectures, keynotes and TEDtalks are available online. Viewers often navigate the contentof the presentation video by manually scrubbing on the videotimeline. Recently, some presentation authors have summa-rized their presentation videos to allow users to skim andbrowse the content of the video. Users can read the videosummary captions or click the portion they are interested into view more detail. However, the authors currently do thisby hand with only the raw video footage. This requires man-ually scrubbing, annotating relevant content, writing sum-maries and selecting keyframes. For this project I designedan interface that links the video and transcript to help authorscreate skimmable video summaries. I also investigated meth-ods for selecting key frames and summarizing transcript text.

Author KeywordsVideo summaries; text summaries; visualization

INTRODUCTIONPresentations are often given in person to an audience witha rehersed script and visual aids. To increase the availabil-ity of the presentation material, presentation authors or otherorganizers often publish the presentation content online. Asstreaming and downloading speeds improve, and popularityof online education increases, we are seeing an increasingnumber of these videos online. For example, the TED confer-ence series has over 1500 published presentation videos andthe edX site now includes 110 courses worth of lecture videos[2, 5].

Watching an entire presentation video takes around 20 - 90minutes. This is time consuming. If a user rather skim thevideo or is only interested in a short segment they may eitherusually read a short caption or manually scrub through avail-able content. This could happen when the viewer is revisitinga lecture video and wishes to re-watch only a portion that s/hedid not recall. Or, the user could be referencing a new talk inwhich only a small portion is relevant. For instance, a lecturevideo where only a portion pertained to the user’s research or

Submitted to CS294-10.Class final project.

a

b

Figure 1. An example video with an indexable summary. (a) shows thevideo player which can be scrubbed to index the video at a point manu-ally. The current method for viewing and indexing videos relies on thistype of video player. (b) The video summary includes a thumbnail andcaption for each summarized portion of the video. By clicking on anythumbnail or caption, the video will begin playing at the beginning ofthe summarized portion so that the user can gain more detail.

a company product release where only a a few of the productsdiscussed are useful to the user.

Some presentation authors have published videos of their pre-sentations accompanied by a video summaries with captionsand thumbnails [26, 9]. In these examples the user can skimthe text summary and thumbnails to get an overview of thepresentation content. One example allows the user to use thesummary to index into the corresponding portion of the video.This way the user can find more detail for interesting portions[26].

1

Page 2: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

To make such a video summaries, the author manually chosesthe layout for their content, writes a summary for each por-tion, scrubs through the video to find corresponding times-tamp, selects keyframes for each segment and assembles thefinal summary with HTML.

My goal is to make it easier for authors and third parties toauthor and publish video summaries. Using the video tran-script, authors and 3rd party users can skim through the videoand summarize the text with the Summarizing Interface. Withthe video linked to the transcript, they can quickly view cor-responding video segments. The system will automaticallyselect thumbnails for summary lines which users can tweakto their liking. When the user is done summarizing a video,they can use the Summary Viewer interface to view and sharethe final summary which allows users to skim or index intothe video as pictured in Figure 1.

Because the final goal is minimizing the work required bythe summary author, I will also discuss possible methods forautomating pieces of the summarization process.

RELATED WORKIn this project I am working with videos that also include thetranscript with the ultimate goal of partially automating thesummary process. There are three main areas of related work:Video summarization, automatic text summarization, and re-cent interfaces for viewing video summaries.

Video SummariesMoney et. al. defines a video summary as a suscinct rep-resentation of a video through a combination of still im-ages, video segments, graphical representations and textualdescriptors [19]. Many automated video summaries reliedprimarily on image and audio information in the video aloneto extract key frames and important segments but this waschallenging because the representation in video is complex.

Several recent techniques incorporate textual data to improveperformance [14, 24, 29, 22]. Li et. al. use existing text datato help train a model to perform TV series summarizationwhere new summaries do not need text data [14]. Others usemanual or automatically obtained transcripts to aid in creatingvideo summaries [24, 29, 22].

Particularly, Smith and Kanade explored video skimmingand characterization through image and language understand-ing techniques using manual text transcripts [22]. Giventhe video and the corresponding transcript, they produce avideo skim that consists of keyphrases and correspondingkeyframes to go along with the phrases. They used ob-ject recognition, shot detection, audio processing and TFIDFweights to create the summaries from these inputs. However,their skims do not read fluently as they are partial sentencestaken from the transcripts. In addition, the algorithm wasmainly applied to feature films with many visually differentshots and the user can not index into the original video usingthe resulting summary.

Some work has particularly focused on summarizing presen-tation videos [12, 11]. VAST MM was a presentation video

summary browser based on exposing key words and shot de-tection areas [12]. Auto-summarization for video-audio pre-sentations exploits the audio stream and other user’s accessdata to generate video summaries that are 25% of the fulllength presentation [11]. However both of these require view-ing the video as limited skimmable text is available for under-standing the content.

Automatic Text SummarizationAutomatic text summarization has been heavily studied inthe NLP community [15]. Extractive summaries are sum-maries that consist of language found within the text. Forinstance, an extractive summary of a news article would be anordered list of sentences that are by some metric “important”to that news article. Abstractive summaries are more closelyrelated to the types of summaries that humans create from ar-ticles. We create these summaries by grouping concepts andsentences within the document to create a more informationdense representation of the article [15].

I will talk about possible methods for producing abstractivesummaries via crowd sourcing in the Future Work portion ofthis paper. I will focus on related work pertaining to extrac-tive summaries here.

Topic ModelsTopic words [21] are a promising avenue for finding impor-tant sentences within each paragraph. This method works byfinding topic words by comparing topic documents to back-ground documents and then doing tf*idf within the document(in our case transcript) only considering the topic words. Thescore is then used to pick the highest scoring sentences to rep-resent. Probabalistic topic models which use Latent SemanticAnalysis, EM or Bayesian methods [6, 23, 15].

Graph Based MethodsGraph based methods basically work by defining some simi-larity metric (for instance tf*idf similarity between sentences)between a unit (like sentences in our case) and then findingeither a single central node, or central nodes in clusters [13].

Supervised TechniquesSupervised techniques allow us to predict a probability that asentences is in an extractive summary or not. Example fea-tures include sentence length, sentence weight (such as intopic words), sentence position, and cue words/phrases. Inaddition we can have context features (e.g. difference fromneighboring sentences) or acoustic features in the case of asupporting audio file [17, 16]. Many types of classifiers areused such as HMM’s [17] and SVM’s [16].

A supervised method seems particularly promising for ourapproach because we could also weight image features to de-termine segment importance. For instance, we could use ges-ture detection to find important moments, or detect when newslides are introduced. Earlier in the video summaries we sawthat [14] used text to help learn how to create summaries.However, we have a stronger assumption in that text could beused to classify summaries as well.

Speech Summarization

2

Page 3: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

Jonathan Corum: Storytelling with Data

Bret Victor: Media for Thinking the Unthinkable

Figure 2. Jonathan Corum and Bret Victor have made caption andthumbnail based skimmable video summaries of their presentationvideos.

News article sumamrization is quite a bit different fromspeech summarization because the sentences are betterformed. This makes news summaries easier to capture withsingle sentences. For instance, take the first sentence in anews article: “After a botched start, the number of peoplesigning up for health insurance plans under the AffordableCare Act has accelerated greatly, with the most substantialgains occurring in Medicaid programs for the poor.” againsta first sentence of a transcript paragraph: “And I put themtogether” [3, 5]. This brings unique challenges that Li et.al. have partially addressed for meeting summarization [16].Murray et. al. detailed a new evaluation method specific tospeech sumamrization [20] and Tucker et. al. approachedsummarization without the transcript by selectively speedingup audio [25].

Interfaces for Viewing Video Summaries

As discussed briefly in the introduction, some authors [9, 26]have created video summaries as seen in Figure 2. The viewercan skim these summaries by reading through the captionsand glancing at the associated thumbnails. In the second sum-marization in Figure 2, the user can also click on a section toview the portion summarized [26]. These summaries servedas inspiration for our video summary interface.

In the past others have considered interactive intefaces for ex-ploring video summaries. In particular Boreckzy et. al. con-sidered image-based summaries with optional captions thatusers could interactively explore [8].

INTERFACESThis system features two interfaces. The author uses the sum-marizing interface to create a video summary. The vieweruses the summary interface to view the author’s summary andindex into the video. In this section I’ll explain features ofeach interface.

Summarizing InterfaceThe summary interface consists of four components as pic-tured in 3.

• A video and transcript overview (a)

• A transcript view (b)

• A summary view (c)

• A video view and export (d)

Video and Transcript OverviewThe overview provides the user an overview of the framesand paragraphs contained in the video and transcript. Theoverview includes a highlighted portion which is updated ac-cording to the part of the transcript the summary author isviewing. In this method, the author can locate the transcriptthey are viewing within the larger document. The user canalso jump around to different portions of the transcript byclicking on the corresponding portion of the overview. Theoverview could be improved by adding annotation for whichparts the author has summarized so that he or she could havean overview of what they have summarized and what theyhave left to summarize.

Transcript ViewThe transcript view allows the viewer to read the transcript ofthe video. Like the overview, the transcript’s paragraphs arepositioned throughout time. Scrolling the transcript is con-nected to both the overview and the summary view so thatthe user can view the current position within the entire doc-ument and the summaries associated with the transcript text.The transcript view is linked to the video player so that if theuser clicks on a portion of the transcript, the video will startplaying at this point. In this method the user can view thetranscript and the video together in order to summarize thepresentation. The transcript view also contains pause mark-ers marked with p that are generated by the video alignment.These pause markers correspond to pauses in the speech andmay be useful in determining useful thumbnails, as the pre-senter has paused to bring attention to the screen, or audience

3

Page 4: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

a b c d

Figure 3. The Summarizing Interface consists of (a) an overview that viewers can use to locate their current position within the entire video, (b) atranscript view to help the users skim and select video content, (c) a summary view lets users summarize content, select thumbnails, and previewsummaries and (d) a video viewer to show the visual content and current time within the video. The export button publishes the content to be used inthe Summary Viewer

reactions, which may mark interesting sections. The tran-script could be improved by adding highlighting so that asthe video played, it would mark the corresponding points inthe transcript.

Summary ViewThe summary view includes three main components: the re-lated text bar, the thumbnail, and the caption creator. To cre-ate a summary the user highlights the portion of the transcriptthat they wish to summarize. After they decide which por-tion they wish to summarize, they select the “\” key to createa summary for that portion. A related text bar will appearto indicate which portion is being summarized. An automat-ically selected thumbnail will appear to visually signify thesummary. A text area will appear for the user to enter a cap-tion. After the user has written the caption, they may pressthe “enter” key to confirm their summary.

At any time the user may choose to edit their thumbnail. Todo this, they can press “c” to capture a new thumbnail fromthe video at the current video time, then they can click anythumbnail in the summary area that they wish to replace. Therecently captured image will then replace the image currentlyin the clicked summary thumbnail.

Currently, the user can not highlight across multiple para-graphs, in the future I will include that feature.

Video Viewer and ExportThe video viewer responds to interactions with the transcriptand summary components. It is also accompanied by a scrub-bable timeline, play/pause control, and timestamp. In the fu-ture, I will link the video to the transcript such that scrub-bing on the timeline will cause the transcript, summary, andoverview highlight to scroll along with the video scrubbing.

After the user finishes creating his summary, he can click“Export” to convert the summary information into a format

readable by the Summary Interface.

Summary InterfaceThe summary interface consists of a single video player andmultiple summarizing captions with thumbnails. The usercan click on any caption to start the video at the summarizedportion that caption corresponds to.

Currently the summary interface requires the user to importthe output from the summarizing interface. In future work thesummary interface will also export a static HTML copy of thevideo summary so that it can be posted on an authors site andshared with others. A current work around to this is by usingthe “Copy as HTML” feature in Chrome’s Browser Tools.

METHODThis section explains technical details associated with eachinterface component.

Summarizing InterfaceVideo and Transcript OverviewThe overview is created by uniformly sampling framesthroughout the duration of the video. The frames are or-dered according to their timestamps and we establish a scaleof video-time to pixels in the y-direction. Then, the transcriptparagraphs are arranged by finding the time of the first wordin the paragraph and then using the scale derived from thevideo frame timeline to place the paragraph at that point. Al-ternatively, one could have positioned the transcript contentsaccording to individual words or sentences instead of para-graphs. Because the transcript view reflects the overview,paragraphs seem the most natural for reading and summa-rizing. In this way, the overview reflects the visual content ofthe video throughout time and the positioning of the transcriptparagraphs throughout the video.

Transcript View

4

Page 5: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

a b

Figure 4. The summary interface consists of a

In this project, videos and corresponding transcripts were ob-tained from the TED website [5]. For videos that are notalready accompanied by transcripts, crowdsourcing websitescan be used to obtain verbatium transcripts [1, 4].

To enable time-related operations within the interface, I ob-tained time stamps for the start and end of each token (wordor pause) using the p2fa library which relies on HTK’s Hid-den Markov Model alignment [28, 18].

As partially explained in the overview section, these times-tamps were used to place the transcript paragraphs along avertical timeline. These timestamps were also used to startthe video when a word is selected.

Summarizing ViewThe summarizing view calculates the height of the summarybar by taking taking the height between the first and last wordhighlighted in the transcript by the user and adding the lineheight to give a visual reminder of the text summarized.

To select an automatic key frame, the summary view takesthe frame half way between the beginning timestamp for thefirst word highlighted to summarize by the user and the endtimestamp for the last word highlighted to summarize by theuser.

This method was chosen because it selects keyframes reason-ably close to the manually chosen keyframe set, especiallycompared to other simple time-based methods Figure 5.

One promising avenue for finding useful keyframes is to de-tect which shots correspond to screen captures vs. video. Forinstance in Figure 5a the 3-6,8th frames correspond to screencaptures and it is easier to view the relevant screen contentthat separates these sections from others in the video. In aprevious experiment to detect keyframes using shot detectionI recorded the number of shots and included this in Figure 5.

One way to detect these could be to train a classifier overthe color and luminance histograms of video camera shots vs.screen capture shots as it seems these would be different fromlooking at the frames. To evaluate the current method andany other methods, we would need to introduce more video

footage.

Video and ExportingAfter the user finishes creating his summary, he can click“Export” to convert the summary information into a JSONfile. This json file includes timestamps for the starting pointof each summarized section, an image in Base64 for a thumb-nail, and the summary text. This JSON is interpreted by theSummary Interface when the user wishes to view the finalproduct.

RESULTSI implemented the summarizing interface and summaryviewer components. These components with preloadedvideo, Hans Rosling’s “The Best Statistics You’ve Ever Seen”[5], are now available online at http://www.eecs.berkeley.edu/

˜

amypavel/demos/vis_project2/interface.html andhttp://www.eecs.berkeley.edu/

˜

amypavel/demos/vis_

project2/summary.html. I also manually created a sum-mary for a 20 minute long talk video, Hans Rosling’s “TheBest Statistics You’ve Ever Seen” [5], which is partiallyincluded in this paper as seen in Table 1 and Figure 5. Theseelements suggest that the keyframes selected by the simpleframe selection algorithm are reasonable for the summariessupplied and closely match the hand picked keyframes forthis random sample of paragraphs in the talk.

FUTURE WORKFuture work will concentrate mainly on reducing effort for thesummary author through automation or crowdsourcing por-tions of the video summarization task and further evaluatingthe summarizing interface used by the author and the sum-mary interface viewed by the viewer by comparing to currentmethods for summarizing and viewing videos.

AutomationCurrently, the author provides all segmentation (by pickingthe portions to summarize) and summarization by hand.

SegmentationSegmenting the video into semantically important pieces tosummarize is difficult on its own. Techniques from Natural

5

Page 6: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

b

c

d

a

Paragraph #|Shots|Duration

24

25.62

33

22.83

88

60.63

139

73.45

146

67.83

155

69.69

163

25.7

204

34.32

Figure 5. This figure shows key frames selected for a set of randomly selected paragraphs from a TED talk. I summarized each paragraph in contextof the entire talk and manually selected a keyframe to go along with each paragraph summary (a). The next row (b) shows the frame automaticallychosen from the middle of the paragraph time as done in our system. (c) and (d) display frames selected from the end and begining of the paragraphsrespectively. The bottom shows index of the paragraph within the transcript, number of shots within the paragraph timespan, and the paragraphduration in seconds.

Language Processing such as topic modeling [10] or topicwords [21] may help distinguish the segments that addressone topic vs segments addressing other topics.

Previous works have also used audio pitch and pauses to helpsegment [24, 14, 22] video into reasonable portions for sum-marization. Sometimes pauses correspond to breaks beforemoving onto a new topic and pitch can correspond to a changein the excitement level of the talk.

Another method is through shot recognition. Most videosummary techniques rely on shot boundary detection fromthe video footage [19]. However, breaking up the video onlyby shots in this case will result in many seperate segments asdisplayed in an earlier figure. Therefore, we will need somemethod to determine which shots correspond to a segment.One such possible segmentation could rely on a visual aid.Say the presenter puts a visual aid on the screen such as achart or a graph to support the current topic. There may bemany shot changes within the time he talks about such a chartor graph. If we could cluster shots based on content and dura-tion, we may be able to extract the entire segment where theuser is talking about the chart. Alternatively, if timing datafor the presentation is available, we could potentially segmentbased on slide changes.

Text summarizationCurrently the author must provide text captions for any por-tion he or she summarizes. Related work discusses many pos-sible avenues for picking sentences for extractive summaries.The most promising avenue is a supervised approach in whichwe can combine text, audio, and video features to learn theimportant poritions of video presentations.

Video Player Aligned Transcript Paragraph

Figure 6. Example of a transcript paragraph and video combinationthat a crowdworker would see in a task. One downside to this approachis that the worker summaries may be disjoint by the end of the task.

Work in speech summarization may also give us methodsfor sanatizing spoken sentences which include tokens such as“um” or “uh” to form a more cohesive extracted summaries.

CrowdsourcingCrowd sourcing provides a promising avenue for creatingvideo summaries. In the past, crowd sourcing has been usedto shorten text in a word processor [7]. Similarly one couldapply crowdsourcing within the video summarization task toshorten the transcript into a video summary.

For crowdsourcing you would not want to only give a videoor a transcript alone because one or the other may not have allof the infromation required. For instance, presentation videosoften reference the visual aid (in this case slides) by sayingsentence such as “Over time we see China moves much dif-ferently than Brazil”. Without the visual aid, it is unclear howChina and Brazil are different and what is causing the move-ment. Therefore, I would present a paragraph video pair as

6

Page 7: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

# Summary for paragraph #2 Swedish students scored significantly worse than random choice on my pretest.3 Professors from the same institution scored on par with random choice. So there was a need to learn about data from

other countries.8 The world’s distribution of income shows that there is no longer a large gap between rich and poor.13 By comparing changes in economic climates overtime, we discover how countries such as South Korea and Brazil are

progressing differently.14 By putting trails on these countries, we see not only is the rate of change different between these countries but also the

paths of change.15 By splitting up aggregate data we find regions vary greatly in economic climate. Thus, discussions of the improvement

of the world must be highly contextualized.16 This data is difficult for the public to access because it is hidden in databases.20 In the coming years as more data is exposed we will get to see new graphics such as this one.

Table 1. This table shows my manually created text summaries for the keyframes in figure 5. The first column shows which paragraph was summarizedand the second column shows a brief summary for the paragraph.

shown in Figure 6.

However, the downside to this approach is that worker sum-maries may be disjoint and not read fluently. To mediate thisissue, I may have workers do several summaries at once incontext of the summaries three before them. Alternatively,each worker could give a summary and later works could re-word the summaries to flow fluently and combine consecutiverepetitive summaries.

This is still assuming that we are segmenting summarizationportions based on transcripts. We could also crowdsource thesegmenting process by asking users to classify paragraphs asthe same or different. One way to do this is by using thegrouping process described by Willett et. al. in which crowd-sourcers color code similar items [27]. This method would bebest suited for grouping parargaphs as it does not scale wellto many items to classify.

EvaluationI will evaluate the summarizing interface component by com-paring it to any existing techniques and manual summariza-tion. The study could potentially be constructed betweenusers such that one user each user only uses a manual ap-proach, previous approach or our approach and then the re-sults of any method are fed to the summary interface for eval-uation. We could then evaluate the time that users take tocreate video summaries, qualitative components, and com-prehensibility of output summary as tested on a second set ofevaluators.

Similarly the summary interface could be compared to othersummary interfaces such as the traditional video viewer, nonindexable summary (thumbnails with captions without op-tion to click to play the video), or other previous methodsfor viewing presentation summaries as mentioned in relatedwork.

InteractionThe current interactions required to create a summary and re-place a capture in the video are simple but unintuitive andinvisible. In the future I would like to investigate better in-terface controls for these operations which still allow fluidcontrol.

One option is to have buttons which show the correspondingkeystroke companions so that after using a button the user canswitch to the keystroke to make the interactions faster. Thisis especially well suited for this interface because there are avery small number of keystrokes which need to be learned tomanipulate available features.

Additional Summarization Interface FeaturesI outlined additional features in the interface section. Thesefeatures include better support for importing and generatingtranscripts, additional overview components, and further link-ing of scrubbable video timeline and transcript component.

CONCLUSIONIn conclusion, in my project I have created an interface forsummarizing video content using the transcript and an in-terface for viewing the summary that is the output from thesummarization interface. I have also detailed several ways inwhich portions of the summarization could be automated andwhat items could be improved in future work.

Moving forward, I will first try out crowdsourcing approachesto the video summarization task and run a few short trials onmechanical turk of my summary interface on a short video tosee how other people use the interface and if it could providea method for crowdsourcers to chose keyframes, specify sum-mary times and summary captions that is faster than the man-ual method. I will also attempt supervised techniques for ex-tractive summarization, using the crowd answers as trainingdata depending on the response quality. Second, I will need togather many additional videos to see if automated and crowd-sourcing techniques work well for other TED videos [5] andvideos on other websites such as edX [2]. Additionally, Iwould like to see how methods applied here would extendto presentation videos without visual aids and other types ofexplanatory videos such as how to videos.

Becuase there are many instances of prior work, evaluationof the inteface and video summaries are very important. Be-cause it is known to be difficult to evaluate summaries [15]as automated methods tend towards local featues rather thanoverall structure and humans often choose different extractivesummaries when asked, I will throoughly define the evalua-tion metrics I use in the end and provide a database of video

7

Page 8: Creating Video Summariesvis.berkeley.edu/courses/cs294-10-fa13/wiki/images/8/87/Amypavel_paper.pdfof the presentation video by manually scrubbing on the video timeline. Recently, some

source URL’s I ran on my system and results produced byusers and automatic methods so that any future work in a sim-ilar domain can easily compare against this work.

REFERENCES1. Casting words. https://castingwords.com/.

2. edX. https://www.edx.org/.

3. New york times. http://nytimes.com.

4. Rev.com. http://www.rev.com/.

5. Ted: Ideas worth spreading. http://www.ted.com/.

6. Barzilay, R., and Lee, L. Catching the drift: Probabilisticcontent models, with applications to generation andsummarization. In Proceedings of HLT-NAACL (2004),113–120.

7. Bernstein, M. S., Little, G., Miller, R. C., Hartmann, B.,Ackerman, M. S., Karger, D. R., Crowell, D., Panovich,K., and Arbor, A. Soylent: A Word Processor with aCrowd Inside.

8. Boreczky, J., Girgensohn, A., Golovchinsky, G., andUchihashi, S. An interactive comic book presentation forexploring video. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems,CHI ’00, ACM (New York, NY, USA, 2000), 185–192.

9. Corum, J. Story telling with data.http://style.org/tapestry/, Feb. 2013.

10. Griffiths, M. S. T. Probabibalistic topic models. LatentSemantic Analysis: A Road to Meaning. (2001).

11. Haubold, A., and Kender, J. R. Augmentedsegmentation and visualization for presentation videos.MULTIMEDIA ’05 (2005), 51.

12. Haubold, A., York, N., and Kender, J. R. VAST MM :Multimedia Browser for Presentation Video. In CIVR’07 (2007), 41–48.

13. Leskovec, J., Milic-Frayling, N., and Grobelnik, M.Impact of linguistic analysis on the semantic graphcoverage and learning of document extracts. InProceedings of the 20th National Conference onArtificial Intelligence - Volume 3, AAAI’05, AAAI Press(2005), 1069–1074.

14. Li, L. Video Summarization via Transferrable StructuredLearning Categories and Subject Descriptors. 287–296.

15. Liu, A. N. S. M. Y. Automatic summarization. http://aclweb.org/anthology//P/P11/P11-5003.pdf,2011.

16. Liu, Y., and Xie, S. Impact of automatic sentencesegmentation on meeting summarization. In Proc.ICASSP, Las Vegas (2008).

17. Maskey, S., and Hirschberg, J. Comparing lexical,acoustic/prosodic, structural and discourse features forspeech summarization. In in Proceedings of theInterspeech (2005).

18. Microsoft. Htk. http://htk.eng.cam.ac.uk/, 2009.

19. Money, A. G., and Agius, H. Video summarisation: Aconceptual framework and survey of the state of the art.Journal of Visual Communication and ImageRepresentation 19, 2 (Feb. 2008), 121–143.

20. Murray, G., Renals, S., Carletta, J., and Moore, J.Evaluating automatic summaries of meeting recordings.In in Proceedings of the 43rd Annual Meeting of theAssociation for Computational Linguistics, Workshop onMachine Translation and Summarization Evaluation(MTSE), Ann Arbor, Rodopi (2005), 39–52.

21. Ohtsuki, K., Matsutoka, T., Matsunaga, S., and Furui, S.Topic extraction with multiple topic-words inbroadcast-news speech. In Acoustics, Speech and SignalProcessing, 1998. Proceedings of the 1998 IEEEInternational Conference on, vol. 1 (1998), 329–332vol.1.

22. Smith, M. A., and Kanade, T. Video Skimming andCharacterization through the Combination of Image andLanguage Understanding Techniques. ICCV ’98 (1997).

23. Steinberger, J. Text summarization within the LSAframework. PhD thesis, University of West Bohemia,2007.

24. Taskiran, C. M., Amir, A., Ponceleon, D., Delp, E. J., B,K., Road, H., and Jose, S. Automated VideoSummarization Using Speech Transcripts. Storage andRetrieval for Media Databases (2002).

25. Tucker, S., and Whittaker, S. Temporal compression ofspeech: An evaluation. Audio, Speech, and LanguageProcessing, IEEE Transactions on 16, 4 (2008),790–796.

26. Victor, B. Media for thinking the unthinkable. http://worrydream.com/MediaForThinkingTheUnthinkable/.

27. Willett, W., Ginosar, S., Steinitz, A., Hartmann, B., andAgrawala, M. Identifying Redundancy and ExposingProvenance in Crowdsourced Data Analysis. IEEETransactions on Visualization and Computer Graphics(Oct. 2013).

28. Yuan, J., and Liberman, M. Speaker identification on thescotus corpus, 2008.

29. Zhu, X., Wu, X., Fan, J., Elmagarmid, A. K., and Aref,W. G. Exploring video content structure for hierarchicalsummarization. Multimedia Systems 10, 2 (Aug. 2004),98–115.

8