10
52 September/October 2010 Published by the IEEE Computer Society 0272-1716/10/$26.00 © 2010 IEEE Tutorial Multimedia Analysis + Visual Analytics = Multimedia Analytics Nancy A. Chinchor ChinchorEclectic James J. Thomas and Pak Chung Wong Pacific Northwest National Laboratory Michael G. Christel Carnegie Mellon University William Ribarsky University of North Carolina at Charlotte M ultimedia analysis has focused on im- ages, video, and, to some extent, audio and has made progress in individual media types. It hasn’t focused on multiple media types or text analysis. Visual analytics has focused on user interaction with data during the analytic process plus the fundamental mathematics, and has continued to treat text as did its precur- sor, information visualization. (Generally, we use “analytics” to mean the science of human analysis.) This tutorial addresses com- bining multimedia analysis and visual analytics to deal with in- formation from different sources, with different goals or objectives, and containing different media types and combinations of types. The resulting combination is multimedia analytics. Historical Perspective To begin, here’s a brief history of multimedia analysis and visual analytics, noting these distinct fields’ significant progress through the years. Multimedia Analysis Modern multimedia information retrieval (MIR) is rooted in computer vision, digital image pro- cessing, and pattern recognition, which started in the late 1970s to early 1980s. Since then, new technologies have continued to emerge in the mul- timedia R&D community. In the 1980s, when digitized images weren’t an archival medium for the general public, research frequently covered edge finding, boundary and curve detection, region growing, shape identifica- tion, feature extraction, and so on, of individual images or frames of images. In the 1990s, when digital video and images became part of our everyday experience, content- based image retrieval (CBIR) and content-based video clip retrieval were among the most impor- tant R&D accomplishments. Robust shot boundary detection and database information query were two of the most active research topics in academic and industrial research labs. The 1990s also saw the World Wide Web’s arrival. The Web brought large amounts of multimedia information directly to our desktop computers and further stimulated the rapid growth of the multimedia and enter- tainment industries. The first ACM Multimedia International Conference, which included MIR as a major topic, was in 1993. The MIR community’s primary R&D goal in the 1990s was to develop computer-centric tech- nologies for researchers’ use only. In contrast, the current primary goal is developing human-centric technologies that bridge the gap between general users and the technologies delivering multimedia information to them. We see more attempts to retrieve not just video or image information but To deal with the extent and variety of digital media, researchers are combining multimedia analysis and visual analytics to form the new field of multimedia analytics. This article gives some historical background, discusses surveys of related research, describes initial multimedia analytics research, and reports on benchmark datasets.

Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

Embed Size (px)

Citation preview

Page 1: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

52 September/October2010 PublishedbytheIEEEComputerSociety 0272-1716/10/$26.00©2010IEEE

Tutorial

Multimedia Analysis + Visual Analytics = Multimedia AnalyticsNancy A. Chinchor ■ ChinchorEclectic

James J. Thomas and Pak Chung Wong ■ Pacific Northwest National Laboratory

Michael G. Christel ■ Carnegie Mellon University

William Ribarsky ■ University of North Carolina at Charlotte

Multimedia analysis has focused on im-ages, video, and, to some extent, audio and has made progress in individual

media types. It hasn’t focused on multiple media types or text analysis. Visual analytics has focused on user interaction with data during the analytic

process plus the fundamental mathematics, and has continued to treat text as did its precur-sor, information visualization. (Generally, we use “analytics” to mean the science of human analysis.)

This tutorial addresses com-bining multimedia analysis and visual analytics to deal with in-formation from different sources, with different goals or objectives, and containing different media types and combinations of types. The resulting combination is

multimedia analytics.

Historical PerspectiveTo begin, here’s a brief history of multimedia analysis and visual analytics, noting these distinct fields’ significant progress through the years.

Multimedia AnalysisModern multimedia information retrieval (MIR) is rooted in computer vision, digital image pro-cessing, and pattern recognition, which started

in the late 1970s to early 1980s. Since then, new technologies have continued to emerge in the mul-timedia R&D community.

In the 1980s, when digitized images weren’t an archival medium for the general public, research frequently covered edge finding, boundary and curve detection, region growing, shape identifica-tion, feature extraction, and so on, of individual images or frames of images.

In the 1990s, when digital video and images became part of our everyday experience, content-based image retrieval (CBIR) and content-based video clip retrieval were among the most impor-tant R&D accomplishments. Robust shot boundary detection and database information query were two of the most active research topics in academic and industrial research labs. The 1990s also saw the World Wide Web’s arrival. The Web brought large amounts of multimedia information directly to our desktop computers and further stimulated the rapid growth of the multimedia and enter-tainment industries. The first ACM Multimedia International Conference, which included MIR as a major topic, was in 1993.

The MIR community’s primary R&D goal in the 1990s was to develop computer-centric tech-nologies for researchers’ use only. In contrast, the current primary goal is developing human-centric technologies that bridge the gap between general users and the technologies delivering multimedia information to them. We see more attempts to retrieve not just video or image information but

Todealwiththeextentandvarietyofdigitalmedia,researchersarecombiningmultimediaanalysisandvisualanalyticstoformthenewfieldofmultimediaanalytics.Thisarticlegivessomehistoricalbackground,discussessurveysofrelatedresearch,describesinitialmultimediaanalyticsresearch,andreportsonbenchmarkdatasets.

Page 2: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

IEEEComputerGraphicsandApplications 53

also audio information. However, we haven’t seen successful cases of multimedia information fusion in either the academic literature or patent applica-tions. Overall, audio information retrieval hasn’t played a major role in MIR’s evolution. Even less important has been text information retrieval. Only a few studies have covered analysis of docu-ments containing images and text or any other truly mixed-media forms.

The arrival of handheld mobile devices and the wide popularity of multimedia message services further encouraged industry to develop better in-dexing technologies to organize multimedia infor-mation and better browsing and summarization technology to access the information. The past decade also saw the first ACM International Con-ference on Multimedia Information Retrieval, in 2008. MIR has finally established its own identity and is no longer an R&D track of the multimedia community.

Visual AnalyticsA relatively new suite of technologies has emerged from visual-analytics R&D. Visual analytics aims to provide technology for human-centric analy-sis through dynamic, active visual interfaces for all forms of data, to deal with scale-independent analytics. It employs fundamental mathematics to represent information and transform it into com-putable forms and employs knowledge sciences to represent multidimensional information.

Visual analytics involves developing a high-dimensional analytic space to enable detection of the expected and discovery of the unexpected during analytical thinking. Visual-analytics re-searchers envision a highly engaging intuitive vi-sual interface based on cognitive principles that enables a thought process for analyzing multi-media information across multiple applications. This vision developed out of the natural growth of computer graphics and visualization.

Computer graphics started in the 1970s, focus-ing on animation, realization, and computer-aided design and engineering, primarily for the automo-tive and aircraft industries. There was also broad interest in developing and applying computer graphics technologies for scientific domains. A core publication setting an R&D agenda spurred interest in visualization’s potential for scientific computing.1 Consequently, many fundamental research programs in scientific visualization and the IEEE Visualization Conference were launched in the mid-1980s. Although nonscientific applica-tions of visualization were also of interest, a clear focus on scientific domains emerged. This focus

stimulated research funding for visualization in chemistry, biology, astronomy, atmospheric sci-ences, and many other fields, significantly increas-ing their capabilities.

In the early 1990s, a US government group asked several scientists in research centers to consider visualization of unstructured text documents. At the time, many researchers were visualizing biological sequences for drug discovery; however, developing visualizations for text analysis seemed difficult and had little mathematical foundation.

In tackling text visualization, researchers focused on visualizing 200 to 2,000 documents in a rela-tively simple format. A prototype for this task ap-peared in early 1994 and was highlighted on the cover of the proceedings of the first IEEE Sympo-sium on Information Visualization.2 This field of study grew rapidly in the late 1990s as many saw the opportunities in information visualization. Spatial Paradigm for Information Retrieval and Exploration (Spire) technology3 formed the basis of much of this research.

The 2001 terrorist attack on the US stimulated a relook at technology to reduce the risk of another attack through effective analysis of all forms and types of information. Also, analysts were swamped by the ever-increasing amount and complexity of information. This situation stimulated the US Department of Homeland Security to establish the National Visualization and Analytics Center (NVAC) at the Pacific Northwest National Labo-ratory (PNNL) in 2004 to consider a new visu-alization approach. The PNNL-developed In-Spire technologies4 demonstrated that new approaches were possible. In 2005, a team of approximately 40 individuals from industry, academia, government, and national laboratories developed a visual-analytics R&D agenda.5 This agenda was the foundation for visual analytics and included new thinking on multi-modal visual analytics.

The Need for Multimedia AnalyticsFull multimedia analytics has been slow to de-velop, so we’re attempting to bring attention to the critical new suite of technologies required to

The 2001 terrorist attack on the US stimulated a relook at technology to reduce the risk of another attack through effective analysis of all forms of information.

Page 3: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

54 September/October2010

Tutorial

analyze images, text, video, geospatial data, audio, graphics, tables, and other forms of information. Multimedia analytics is a critical need for a broad range of applications, including, but not limited to, medicine, economics, social media, and security.

SurveysA glimpse of some of the major peer-reviewed sur-veys on multimedia-analysis and visual-analytics topics lays a foundation for the field of multimedia analytics.

Multimedia AnalysisMIR includes topics from user interaction, data analytics, machine learning, feature extraction, information visualization, and more. The MIR community generally doesn’t classify text or docu-ment information as multimedia data. Also, image media represents most of the MIR community’s work. MIR researchers often study video media, but audio media shows up in only a handful of applications. A particularly difficult problem is dealing with multimedia information from differ-ent sources and with different goals or objectives.

Philippe Aigrain and his colleagues addressed image and video information retrieval.6 They cov-ered traditional

■ video analytics topics related to color, texture, shape, and spatial similarities;

■ video-parsing topics such as temporal segmenta-tion, object motion analysis, framing, and scene analysis; and

■ video abstraction topics such as skimming, keyframe extraction, content-based retrieval of clips, indexing, and annotation.

Yong Rui and his colleagues described a com-plete taxonomy of early, classic image information retrieval techniques.7 They mainly covered

■ feature extraction techniques related to color, texture, shape, color layout, and segmentation;

■ image-indexing techniques such as dimensional reduction and multidimensional indexing; and

■ image retrieval systems developed in the 1990s.

Many of the covered techniques have become the foundational technology for multimedia systems and applications today.

Arnold Smeulders and his colleagues focused on image processing, pattern analysis, and machine learning.8 They started by discussing basic compo-nents such as color and texture. They then visited the more advanced topic of features, which can be extracted from an image and form a hierarchy of global features, salient features, signs, shapes, and object features. They also covered machine-learning topics of similarity matching and seman-tic interpretation as well as database topics such as image indexing, storage, and query.

Cees Snoek and his colleagues described video indexing as a hierarchy that groups different in-dex types.9 This hierarchy characterizes different genres (such as news, sports, movies, or commer-cials) and subgenres (such as basketball and ice hockey) in terms of their most prominent layout and contents. They split the hierarchy into named events (such as football games and tennis matches) and logical units (such as car chases and violence).

Michael Lew and his colleagues covered image, video, and audio information retrieval from data gathered from different sources and stored in an archive.10 They focused more on human-centric topics that bridge the semantic gap between users and their multimedia information and less on tra-ditional computation-centric topics such as simi-larity search. They attempted to bridge the semantic gap by “translating the easily computable low level content-based media features to high level concepts or terms which would be intuitive to the users.”10

Although they discussed audio-related problems, most problems they discussed were related to video and image information retrieval. They didn’t discuss multimedia fusion or retrieval of combined multi-media concepts.

Ritendra Datta and his colleagues addressed CBIR.11 Their survey had a strong data-mining fla-vor and covered all aspects of knowledge discovery of image databases. Besides technical topics such as signature extraction, clustering, categorization, visualization, and similarity matching, they dis-cussed nontechnical issues such as aesthetics, se-curity, the Web, and storytelling.

In other notable surveys, Rainer Lienhart focused on shot boundary detection in video,12 Ming-Hsuan Yang and his colleagues focused on face detection,13

and Johan Tangelder and Remco Veltkamp focused on content-based 3D shape retrieval.14

Two very recent papers excellently summed up the state of the art in multimedia analysis for begin-ning readers. Snoek and Smeulders tried to answer

Multimedia analytics is a critical need for a broad range of applications, including, but not limited to, medicine, economics, social media, and security.

Page 4: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

IEEEComputerGraphicsandApplications 55

the question, “Is visual-concept search solved?”15

and Kristen Grauman reviewed new algorithms providing robust but scalable image search.16

Finally, the common belief is that there are no solved problems in the MIR community, which includes the more traditional image and video re-trieval community. As Lew and his colleagues put it, “In some cases a general problem is reduced to a smaller niche problem where high accuracy and precision can be quantitatively demonstrated, but the general problem remains largely unsolved.”10

Visual AnalyticsThe major publications surveying visual analytics have been produced by large groups of researchers. As we mentioned earlier, the first R&D agenda ap-peared in 2005.5 In December 2009, a special issue of the Journal of Information Visualization looked to visual analytics’ past and future. For an overview, we recommend the guest editors’ introduction in that issue.17 That introduction also discussed five success stories demonstrating the early technolo-gies’ value.

Multimedia Analytics’ BeginningsHuman communication is multimodal. Linguistic studies of sign language in the late 1970s and early 1980s concluded that visual language is just as rule-based and creative as spoken language. How-ever, researchers also noticed that visual cues ac-companying spoken language contain significant syntactic and phonological information. Even in our electronically connected world, we find it more satisfying to communicate in person. A telephone, email, or even simply a curtain or darkness be-tween us makes reading the other person difficult.

Analysis of multimedia is no different. Media use multiple modalities to communicate. For ex-ample, a video rarely has no sound track, and most reports have some graphics, whether they’re tables or images. Multiple media types collected for many different purposes are regularly presented to us digitally for interpretation. For us to accom-plish our analysis quickly, the computer must be able to access these records to suit our immediate needs, utilizing the many rich connections among the media types that humans have placed there for communicative purposes. In a document, fig-ures, images, and even video clips can enhance the text, and the ways in which these types of media combine to form the overall message aren’t well understood. Visual analytics has been a boon to dealing with large data collections. Multimedia analysis has made significant strides in analyzing each type of media. Computational linguistics also

has come a long way in extracting meaning from large collections of text and speech. These fields have much to offer each other.

Carnegie Mellon University’s Informedia project (www.informedia.cs.cmu.edu) provides an exam-ple of the synergy among these fields. The proj-ect’s products result from scientific studies of how humans analyze multimedia; thus, they illustrate multimedia analytics.

Informedia researchers noticed that extracting evidence and support materials from large video

repositories can be extremely tedious, owing to the linear time-dependent nature of audio and video recordings, especially those stored on tape. To facilitate better navigation into and across these recordings, Informedia employed speech recogni-tion, image processing, and language technologies to derive synchronized metadata and indexing. Benchmarking forums such as the US National Institute of Standards and Technology Text Re-trieval Conference Video Retrieval Evaluation (NIST TRECVID) track have charted Informedia’s progress over the years.

Informedia research has also focused on inter-faces that leverage metadata to deliver efficient, effective retrieval from multimedia corpora.18

Rather than predetermine how the user should view a large story collection, the Informedia in-terface provides multiple views.18 These views draw from information visualization and library science research. (For some examples of Informedia inter-face research, see the “Informedia and Multimedia Analytics” sidebar.)

For materials with an audio narrative track, such as documentaries or broadcast news, automatic speech recognition (ASR) can provide searchable transcripts for navigation. The first researchers to investigate such transcripts’ accuracy for video retrieval were Informedia’s Alex Hauptmann and Michael Witbrock. They showed that the informa-tion retrieval can be adequate despite ASR tran-scription mistakes.

From 1997 to 2000, the NIST TREC Spoken Document Retrieval track further investigated this issue. It concluded that retrieval of excerpts

Informedia research has also focused on interfaces that leverage metadata to deliver efficient, effective retrieval from multimedia corpora.

Page 5: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

56 September/October2010

Tutorial

from broadcast news using ASR for transcription allows relatively effective information retrieval, even with word-error rates of 30 percent. If ac-curate transcripts are available, ASR can still add value. ASR engines can provide tight word-time alignment, which supports pinpoint navigation of elements of interest in a broader audio or video.

Much Informedia research has emphasized us-ing the multiple modalities of text, image, and speech to compensate for automated process-

ing’s deficiencies. The group’s interface research has also involved users to overcome automated-processing problems, such as a system misrec-ognizing a textured set of trees as a crowd or misrecognizing the term “sax” as a person in named-entity tagging. Through such research and through benchmarking activities such as TRECVID, users will undoubtedly have more tools to deal with greater volumes of information. Sys-tems can also employ active learning, in which

In the following examples, two demonstration corpora illustrate how Carnegie Mellon University’s Informedia

project (www.informedia.cs.cmu.edu) has employed speech, image, and language processing to improve navigation in a video corpus. The HistoryMakers African-American oral history digital archive (www.idvl.org/thehistorymakers) is 913 hours of 18,254 stories, with one shot per story. The US National Institute of Standards and Technology Text Retrieval Conference Video Retrieval Evaluation (NIST TRECVID) 2006 broadcast news test set is 165 hours of US, Arabic, and Chinese news sources comprising 5,923 stories, with 146,328 shots.

Further field and evaluation work on these corpora ap-pears elsewhere, along with a complete set of references.1

These examples aim to show the potential of multiple coordinated views of a multimedia space in the hands of an intelligent human operator for video exploitation.

The HistoryMakers ArchiveHistoryMakers users are interested in story access, with much information contained in the audio. Consider a user interested in music—specifically, gospel, jazz, rap, and rock. Querying this dataset returns 2,372 stories that match one or more of the terms.

Figure A shows a scatterplot of points across a time line, and a visual information browsing environment (VIBE) plot. The VIBE plot, first developed through the University of Pittsburgh Library and Information Science Program, lets users see and manipulate query terms’ contributions to the result space. The initial view shows that “music” is the dominant term, appearing in 2,152 of the results. Unchecking the “music” query term shows that 823 stories remain, discussing one or more of gospel, jazz, rap, and rock. Drawing a bounding box in the VIBE plot, the user can drill down to the 23 stories discussing two or more of these terms.

Further interface elements include color-coded match terms, a relevance score bar in the segment grid view with each result, and information brushing across views. In Figure A, the user’s curser hovers over Isaac Hayes in the segment grid view, showing a tooltip and coloring yellow the points related to this story in the VIBE and time line

views. A yellow point in the VIBE plot indicates that this story discusses rock, rap, and music. A yellow point in the time line shows that this story discusses the 1970s.

The views offer quick, comparative lenses on the data, emphasizing geography, time, and named-entity relation-ships, among other things. For example, drilling into the 141 rap stories from the 1940s through the 1980s shows that most 1940s references drop out: 34 references show for rap, with only one in the 1940s. Drilling into the 487 jazz stories and opening the named-entities view produces the results in Figure B.

Through a sequence of interactions, the user pro-vides context with which to build more carefully tuned interfaces to serve his or her needs. Perhaps the user was querying only people, in which case he or she could open the named-entity view to emphasize solely people con-nections rather than people, places, and organizations. The user can select a node such as “Billie Holiday” and ask to see a video skim, or highlight reel, of all the clips tying Billie Holiday to jazz (in this case, three clips). Work with TRECVID video summarization tasks has proven the difficulty of generating video skims in the general case,

Informedia and Multimedia Analytics

Figure A. The segment grid, visual information browsing environment

(VIBE), and time line views of an Informedia video results set. These

views are coordinated, and users can manipulate them.

Page 6: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

IEEEComputerGraphicsandApplications 57

users mark mistakes so that the system can learn from them and apply that learning when building future classifiers.

In addition, Informedia researchers have worked on approaches that can deliver improved recogni-tion but are computationally expensive. For ex-ample, MoSIFT (Motion Scale-Invariant Feature Transform) recognizes activities in surveillance videos by exploiting continuous object motion explicitly calculated from optical flow, integrated

with distinctive appearance features.19 Such video retrieval approaches’ value is assessed by interna-tional benchmarking forums, such as TRECVID, that chart progress on video analytics tasks.

Benchmark Datasets and EvaluationAt the IEEE VisWeek 2009 Workshop on Video Analytics, we noticed a palpable excitement about the future combination of multimedia analysis and visual analytics to support digital-data analysis.

without such user-provided context. With such context, however, the interactive search system can reduce a 913-hour corpus to the 83 seconds that best tie Billie Holiday to jazz, in which the user can listen to a discussion on Billie’s city jazz and country jazz.

The Broadcast News Test SetBroadcast news retrieval is typically shot-based, with users looking for relevant shots on the basis of aural and visual information—for example, to find shots of riots, protests, or marches containing crowds of people. Building directly from progress on interactive video retrieval and visual shot classification as charted with TRECVID, Informedia has produced dense shot thumbnail views with interactive filters that control various semantic tags (see Figure C). These tags were produced primarily by Carnegie Mellon University Language Technologies Institute PhD students over the years under Alex Hauptmann’s direction. These students used machine-learning techniques and approaches to gen-eralize the classifiers across corpora, with lessons learned reported in the TRECVID online proceedings (www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html).

In this example, a user starts with a query on “march protest riot assembly” that returns 627 stories containing 16,587 shots. Leveraging the user context to make the interface more manageable, the query context lets the inter-face show only those shots in the query matches’ neighbor-hood. This reduces the plot set for the shot thumbnails view from 16,587 to 1,682. This reduction exploits synchro-nized metadata describing the video corpus: time-aligned narration and shots.

To further reduce the display’s complexity, a user inter-ested in outdoor crowds could select the concepts “out-door” and “crowds” from a broader set and apply them as filters (see Figure C). The remaining 56 shots can be reviewed for relevance much more quickly than the original set, with the user free to add “road” or other semantic tags as filters.

Reference1. M.G. Christel, Automated Metadata in Multimedia Information

Systems: Creation, Refinement, Use in Surrogates, and Evalua­

tion, Morgan and Claypool, 2009; doi: 10.2200/S00167ED1

V01Y200812ICR002.

Figure B. The named-entity and video highlight views of an

Informedia video results set. The user context provides interface

power, allowing the interface to show a precise Billie Holiday

summary.

Figure C. The shot thumbnails view of the Informedia video results

set for broadcast news. Interactive semantic-tag sliders control the

precision/recall trade-off for investigating visual shots.

Page 7: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

58 September/October2010

Tutorial

However, there was a resounding call for datasets. So, here we describe benchmark datasets and eval-uation in the two fields.

Multimedia AnalysisTRECVID, under the coordination of Alan Smea-ton, Wessel Kraaij, and Paul Over, has charted the progress of various video retrieval tasks, including shot detection, semantic indexing, and fully auto-matic and interactive retrieval.

Shot detection decomposes a video narrative into component shots. This decomposition enables higher-level processing to classify shots with attri-butes and to allow construction of a visual table of contents through thumbnail representations of shots (shot thumbnails). A storyboard of shot thumb-nails serves a purpose similar to ASR indexing of audio: a means to survey and navigate a linear video presentation. By 2006, most participating systems performed shot detection at greater than 90 percent accuracy, with TRECVID retiring the task in that year to focus on more challenging issues.

Semantic indexing, which automatically assigns semantic tags to video sequences (such as shots), can be fundamental technology for filtering, cat-egorization, browsing, search, and other video ex-ploitation. TRECVID experiments have shown that some visual concepts, such as “face” or “text,” can be automatically tagged to video with excellent ac-curacy, whereas others, such as “bridge,” “bus,” or “flower,” remain challenging.

In light of automatic classification’s varying accuracy for different semantic tags, Informedia has developed interfaces that let users control whether to require greater precision (seeing fewer candidates with anticipated higher accuracy) or greater recall (seeing more candidates, to avoid missing anything of relevance) for a given task and tag. Allowing the user interactive control over storyboard interfaces and over which tags to apply as filters and what degree of precision or recall to use has consistently improved per-formance of video retrieval tasks. For example, TRECVID experiments confirmed the value of a person in the loop for shot-based retrieval. In these experiments, a human using an interactive search system significantly outperformed fully automatic search.

On the other hand, for TRECVID 2009, for three of the 24 search topics, the best automated system outperformed the best interactive system. As some visual-indexing schemes mature, such as face de-tection, automated tasks that can best exploit them will also improve dramatically. Examples of such tasks are

■ finding crowds of people and■ people at desks.

These are two of the three topics on which the automated systems did so well.

Visual AnalyticsWhen the field of visual analytics emerged in 2005, a new evaluation effort started. A big issue was the lack of data. So, NVAC started a project to create synthetic datasets very close to real data without issues of classification or personally identifiable in-formation. This project now creates the datasets for the annual IEEE Visual Analytics Science and Tech-nology Challenge (http://hcil.cs.umd.edu/localphp/hcil/vast10/index.php). The project was originally expected to last about 10 years, starting with rela-tively easy-to-analyze data and then employing in-creasingly complex data. Today, these datasets are publicly available and being used in education, in-dustry, and government-funded research. Each da-taset is a “ground truth” scenario that approximates real situations but is completely open for analysis.

In addition, NVAC established a program to go beyond usability evaluation to utility evaluation. This is a major change. Visual-analytics researchers want to be able to evaluate technologies to show the effectiveness of not only the interface but also the analytic improvement that was the interface’s goal. The overall goal is to develop evaluation pro-cesses that enable researchers to scientifically prove the new technologies’ increased analytic value.

Recent progress in visual analytics has been manifested in presentations at large meetings of US government analysts. Visual analytics has been so successful that analysts can now look for more data rather than bemoan the problem of too much data. The established wisdom is now that visual-ization for data discovery differs significantly from visualization for illustration of evidence.

The VAST challenges so far have resulted in re-searchers seeing their technology more and

more as analysts see it. As these technologies be-come more successful, they’ll incorporate seamless collection of additional data during analysis. Ana-lysts are valued by their reporting. Their reports’ quality is enhanced by their handling of larger quantities of data and their ability to discover the unexpected. However, researchers have done little work to support the generation of illustrations ex-plaining large quantities of data.20

For a list of publications and societies related to multimedia analytics, see the related sidebar.

Page 8: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

IEEEComputerGraphicsandApplications 59

AcknowledgmentsThe US National Science Foundation has supported the Informedia research reported here under grant IIS-0705491. The US National Visualization and Analytics Center at the Pacific Northwest National Laboratory has also supported some of the described research. The Battelle Memorial Institute manages the laboratory for the US Department of Energy under contract DE-AC06-76RL01830.

References1. ACM Siggraph Computer Graphics (special issue on

visualization in scientific computing), vol. 21, no. 8, 1987.

2. Proc. IEEE 1995 Symp. Information Visualization (IV 95), IEEE CS Press, 1995.

3. J.A. Wise et al., “Visualizing the Nonvisual: Spatial Analysis and Interaction with Information from Text Documents,” Proc. 1995 IEEE Symp. Information Visualization (IV 95), IEEE CS Press, 1995, pp. 51–58.

4. E. Hetzler and A. Turner, “Analysis Experiences Using Information Visualization,” IEEE Computer Graphics and Applications, vol. 24, no. 5, 2004, pp. 22–26.

5. J.J. Thomas and K.A. Cook, eds., Illuminating the Path: The Research and Development Agenda for Visual Analytics, IEEE CS Press, 2005.

6. P. Aigrain, H. Zhang, and D. Petkovic, “Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review,” Multimedia Tools and Applications, vol. 3, no. 3, 1996, pp. 179–202.

7. Y. Rui, T.S. Huang, and S.-F. Chang, “Image Retrieval: Current Techniques, Promising Directions, and Open Issues,” J. Visual Communication and Image Representation, vol. 10, no. 1, 1999, pp. 39–62.

8. A. Smeulders et al., “Content-Based Image Retrieval at the End of the Early Years,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, 2000, pp. 1349–1380.

9. C.G.M. Snoek et al., “MediaMill: Exploring News Video Archives Based on Learned Semantics,” Proc. 13th ACM Int’l Conf. Multimedia (MM 05), ACM Press, 2005, pp. 225–226.

10. M.S. Lew et al., “Content-Based Multimedia Information Retrieval: State of the Art and Challenges,” ACM Trans. Multimedia Computing, Communications, and Applications, vol. 2, no. 1, 2006, pp. 1–19.

11. R. Datta et al., “Image Retrieval: Ideas, Influences, and Trends of the New Age,” ACM Computing Surveys, vol. 40, no. 2, 2008, article 5.

12. R. Lienhart, “Reliable Transition Detection in Videos: A Survey and Practitioner’s Guide,” Int’l J. Image and Graphics, vol. 1, no. 3, 2001, pp. 469–486.

13. M.-H. Yang, D.J. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey,” IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 24, no. 1, 2002, pp. 34–58.

14. J. Tangelder and R.C. Veltkamp, “A Survey of Content Based 3D Shape Retrieval Methods,” Proc. Int’l Conf. Shape Modeling and Applications, IEEE Press, 2004, pp. 157–166.

15. C.G.M. Snoek and A.W.M. Smeulders, “Visual-Concept Search Solved?” Computer, vol. 43, no. 6, 2010, pp. 76–78.

16. K. Grauman, “Efficiently Searching for Similar Images,” Comm. ACM, vol. 53, no. 6, 2010; pp. 84–94; http://portal.acm.org/citation.cfm?id=1743546.1743570.

17. J. Kielman, J. Thomas, and R. May, “Foundations and Frontiers in Visual Analytics,” J. Information Visualization, vol. 8, no. 4, 2009, pp. 239–246.

18. M.G. Christel, Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in Surrogates, and Evaluation, Morgan and Claypool, 2009; doi: 10.2200/S00167ED1V01Y200812ICR002.

19. M. Chen et al., “Exploiting Multi-level Parallelism for Low-Latency Activity Recognition in Streaming Video,” Proc. 1st Ann. ACM SIGMM Conf. Multimedia Systems (MMSys 10), ACM Press, 2010, pp. 1–12; http://doi.acm.org/10.1145/1730836.1730838.

20. N. Chinchor and W. Pike, “The Science of Analytic Reporting,” J. Information Visualization, vol. 8, no. 4, 2009, pp. 286–293.

■ 2004–2009 ACM International Multimedia ConferenceProceedings, http://portal.acm.org

■ ACM Transactions on Multimedia Computing, Communications,and Applications, http://tomccap.acm.org

■ IEEE Computer Graphics and Applications, www.computer.org/cga■ IEEE Multimedia, www.computer.org/portal/web/multimedia/

home■ IEEE Transactions on Multimedia, www.ieee.org/organizations/

society/tmm■ IEEE Transactions on Visualization and Computer Graphics,

www.computer.org/portal/web/tvcg■ 2007–2009 IEEE VAST Symposium and Conference

Proceedings, http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=4388976

■ Information Visualization, www.palgrave-journals.com/ivs/index.html

■ ACM Special Interest Group on Multimedia, www.sigmm.org■ IEEE Communications Society, www.comsoc.org■ IEEE Technical Committee on Multimedia Communications,

http://committees.comsoc.org/mmc■ IEEE Technical Committee on Visualization and Graphics,

www.cc.gatech.edu/gvu/tccg

Publications and Societies Related to Multimedia Analytics

Page 9: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

60 September/October2010

Tutorial

Nancy A. Chinchor is founder of ChinchorEclectic. Her research interests are analytics of visual media and multi-media. Chinchor has a PhD in linguistics from Brown Uni-versity. Contact her at [email protected].

James J. Thomas is an American Association for the Ad-vancement of Science Fellow and Laboratory Fellow at the Pacific Northwest National Laboratory. He’s founder and past director of the Department of Homeland Security Na-tional Visualization and Analytics Center. His research interests include establishing investment directions for vi-sual analytics and information and computing technology, leading major technology initiatives, mentoring staff, and serving as a principal investigator on several major science programs. Thomas has a master’s in computer science from Washington State University. Contact him at [email protected].

Pak Chung Wong is chief scientist at the Pacific North-west National Laboratory. His research interests are visu-alization, bioinformatics, data signature, computational

science, steganography, and wavelets. Wong has a PhD in computer science from the University of New Hampshire. Contact him at [email protected].

Michael G. Christel is a research professor at Carnegie Mellon University’s Entertainment Technology Center. His research interests are edutainment, digital libraries, human-computer interaction, and multimedia analytics. Christel has a PhD in computer science from Georgia Tech. Contact him at [email protected].

William Ribarsky is the Bank of America Endowed Chair in Information Technology and the chair of the Computer Science Department at the University of North Carolina at Charlotte. His research interests are visual analytics, 3D multimodal interaction, bioinformatics visualization, vir-tual environments, visual reasoning, and interactive visu-alization of large-scale information spaces. Ribarsky has a PhD in physics from the University of Cincinnati. He’s on the IEEE Computer Graphics and Applications editorial board. Contact him at [email protected].

stay connected.Keep up with the latest IEEE Computer Society

publications and activities wherever you are.

| IEEE Computer Society| Computing Now

| facebook.com/IEEEComputerSociety| facebook.com/ComputingNow

| @ComputerSociety | @ComputingNow

TM

Page 10: Multimedia Analysis + Visual Analytics = Multimedia Analytics · Multimedia Analysis + Visual Analytics = Multimedia Analytics ... feature extraction techniques related to ... Multimedia

For access to more content from the IEEE Computer Society, see computingnow.computer.org.

This article was featured in

Top articles, podcasts, and more.

computingnow.computer.org