Video personalization and summarization system for usage environment

www.elsevier.com/locate/jvci

J. Vis. Commun. Image R. 15 (2004) 370–392

Video personalization and summarizationsystem for usage environment

Belle L. Tseng*, Ching-Yung Lin, John R. Smith

IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA

Received 5 June 2003; accepted 5 April 2004

Abstract

A video personalization and summarization system is designed and implemented incorpo-

rating usage environment to dynamically generate a personalized video summary. The person-

alization system adopts the three-tier server–middleware–client architecture in order to select,

adapt, and deliver rich media content to the user. The server stores the content sources along

with their corresponding MPEG-7 metadata descriptions. Our semantic metadata is provided

through the use of the VideoAnnEx MPEG-7 Video Annotation Tool. When the user initiates

a request for content, the client communicates the MPEG-21 usage environment description

along with the user query to the middleware. The middleware is powered by the personaliza-

tion engine and the content adaptation engine. Our personalization engine includes the Video-

Sue Summarization on Usage Environment engine that selects the optimal set of desired

contents according to user preferences. Afterwards, the adaptation engine performs the

required transformations and compositions of the selected contents for the specific usage envi-

ronment using our VideoEd Editing and Composition Tool. Finally, two personalization and

summarization systems are demonstrated for the IBM Websphere Portal Server and for per-

vasive PDA devices.

� 2004 Elsevier Inc. All rights reserved.

Keywords: Video personalization; Summarization; Adaptation; Usage environment; MPEG-7; MPEG-21;

Annotation; Rich media description; Digital item adaptation; Rights expression

1047-3203/$ - see front matter � 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.jvcir.2004.04.011

* Corresponding author.

E-mail addresses: [email protected] (B.L. Tseng), [email protected] (C.-Y. Lin), [email protected].

com (J.R. Smith).

mailto:[email protected]

mailto:[email protected]

mailto:[email protected].

B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 371

1. Introduction

With the growing amount of multimedia content, people become more willing

to view personalized multimedia based on their usage environments. When peo-

ple use their pervasive devices, they generally restrict their viewing time on thelimited displays and minimize the amount of interaction and navigation to get

to the content. When they browse video on the Internet, they may want to

get only the videos that match their preferences. Because of the existence of het-

erogeneous user clients and data sources, it is a real challenge to implement a

universally compliant system that fits various usage environments. Most existing

video summarization tools address their applications on proprietary environ-

ments. Some systems display summarized videos using a number of key frames

for each detected scene shot to generate a storyboard (Yeung et al., 1995). Mer-ialdo et al. (1999) maximizes story content value to generate personalized TV

news programs based on user preference and time constraint. Sundaram and

Chang (2000) incorporates film structure rules to detect coherent summarization

segments. Gong and Liu (2001) use singular value decomposition (SVD) of attri-

bute matrix to reduce the redundancy of video segments and thus generate video

summaries. In addition to the systems based on audio-visual information, some

researchers have proposed methods to detect semantic important events based

on other resources. For instance, Aizawa et al. (2001) use brain waves to detectexciting moments of the subjects. Ma and Zhang (2002) genetrate video summary

by using user attention models based on multiple sensory perception. In industry

sectors, companies such as NTT DoCoMo and Virage have implemented com-

mercialized video summarization systems. Japanese Wireless carrier NTT DoCo-

Mo streams assorted video summaries to its I-mode cellular phones over the 3G

network. Virage provides video content management and creates personalized vi-

deo summaries including NHL hockey highlights, NASCAR racing and ABC

News.This paper addresses issues of designing a video personalization and summariza-

tion system in heterogeneous usage environments and provides a tutorial introduc-

tion on their associated issues in MPEG-7 and MPEG-21. The server maintains

the content sources, the MPEG-7 metadata descriptions, the MPEG-21 rights

expressions, and content adaptability declarations. The client communicates the

MPEG-7 user preference, MPEG-21 usage environments, and user query so as to re-

trieve and display the personalized content. The middleware is powered by the per-

sonalization engine and the adaptation engine. The personalization andsummarization system adopts the three-tier server–middleware–client architecture

in order to select, adapt, and deliver rich media content to the user.

This paper is organized as follows. In Section 2, we describe the personalization

and summarization framework that is currently evolving in MPEG-21 and what

had been included in MPEG-7. In Section 3, we describe our video personalization

and summarization system, which includes the user client, the personalization and

adaptation media middleware, and the database server. Details of these three tiers

are described in Sections 4–6.

372 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392

2. Digital item adaptation using MPEG-7 and MPEG-21

To personalize delivery and consumption of multimedia content for the individual

users, a phrase coined by the ISO/IEC MPEG-21 group called Digital Item Adapta-

tion has been identified as the essential framework for personalization. Adaptationaims at providing the appropriate content types to the different terminals via the pre-

ferred networks. MPEG-21 Digital Item Adaptation provides the tools to support re-

source adaptation of content, descriptor adaptation through metadata, and quality

of service management (Digital Item Adaptation, 2002a,b).

Fig. 1 illustrates a fundamental structure for Digital Item Adaptation. The content

sources may include text, audio, images, and/or video. Associated with each source is

the corresponding digital item declaration that identifies and describes the content.

MPEG-21 Digital Item Declaration (DID) provides standard description schemesthat identify the content and include components for resource adaptation, rights

expression, and adaptation rights. Under MPEG-21 DID, it has a standard scheme

for content adaptability called Digital Item Resource Adaptation. While MPEG-7 is

used to describe content using features, semantics, structures, and models (FDIS,

Information Technology, 2001; Manjunath et al., 2002), it also includes adaptation

hints, covered under the MPEG-7 Media Resource Requirements, which overlaps

with MPEG-21. MPEG-7 also includes adaptation hints and that are covered under

the MPEG-7 Media Resource Requirements. In addition, MPEG-21 has a similar

Fig. 1. Block diagram of digital item adaptation using MPEG-7 and MPEG-21. The personalization and

adaptation engines match the content sources and corresponding digital item declarations with the user

query and usage environment through a controlled term list in order to: (1) optimally select the set of

personal content and (2) effectively adapt that dynamic content for the user.


requirement for content adaptability called MPEG-21 Media Resource Adaptability,

and thus overlaps with MPEG-7 as illustrated in Fig. 1.

Rights expression is another subcomponent of MPEG-21 Digital Item Declaration

and describes the rights associated with multimedia content. They include the rights

and permissions for an individual or designated group to view, modify, copy or dis-tribute the content. Among these expressions that are relevant to personalization is

adaptation rights. For example, the owner of an image may allow cropping of the im-

age to fit the desired display size, but not scaling.

From the perspective of the user, the usage environment defines the user profiles,

terminal properties, network characteristics, and other user environments.MPEG-21

Usage Environment is in the process of defining specific requirements to describe such

conditions. The usage environment also includes user preferences, which can partially

or fully be described by the MPEG-7 User Preferences and is adopted as a subcom-ponent of MPEG-21 Usage Environment (Digital Item Adaptation, 2002a). Further-

more, the user can request for a specific content in a special application not defined

by the standard. This request is represented as the user query in Fig. 1.

To match the user�s usage environment with the content�s digital item declaration,

the adaptation engine is introduced to personally select the desired content and opti-

mally adapt those content in accordance with the descriptions. Specifically, users

specify their requests for content through the user query and usage environment.

On the other hand, the digital item declaration encompasses both the rights expres-sion and the content adaptability of the available content. To assist the matching

process, a common language can be provided through the MPEG-7 Controlled Term

List. This list allows specific domain applications to define their taxonomy, their

relationships, and their importance. In summary, the adaptation engine takes the

user query, the usage environment, the digital item declaration, and the controlled

term list to adapt the content sources to generate the personalized content.

In the rest of this section, we describe more details of the MPEG-7 and MPEG-21

components in digital item adaptation for usage environment, media description,rights expression, and content adaptability.

2.1. Media descriptions

Media descriptions identify and describe the multimedia content from different

abstraction levels of low-level features to various interpretations of semantic con-

cepts. MPEG-7 provides description schemes (DS) to describe content in XML to

facilitate search, index, and filtering of audio-visual data (FDIS, Information Tech-nology, 2001; Manjunath et al., 2002). The DSs are designed to describe both the

audio-visual data management and the specific concepts and features. The data man-

agement descriptions include metadata about creation, production, usage, and man-

agement. The concepts and features metadata may include what the scene is about in

a video clip, what objects are present in the scene, who is talking in the conversation,

and what is the color distribution of an image.

MPEG-7 standardizes the description structure; however, technical challenges re-

main. The generation of these descriptions is not part of the MPEG-7 standard and


the technologies for generating them are variable, competitive, and some—non-exis-

tent. As such we have developed an annotation tool to describe the semantic mean-

ing of video clips to capture the underlying semantic meaning by a human. These

annotators are assisted in their annotation task through the use of a finite vocabulary

set. This set can be readily represented by the MPEG-7 Controlled Term List, whichcan be customized for the different domains and applications. For example in a news

program, the list can include the anchors� names, topic categories, and camera cap-

tures. Our annotation tool is described in Section 4.

2.2. Usage environment

The usage environment holds the profiles about the user, device, network, deliv-

ery, and other environments. This information is used to determine the optimal con-tent selection and the most appropriate form for the user. Other than MPEGs,

several standards have been proposed. HTTP/1.1 uses the composite capabilities/

preference profile (CC/PP) to communicate the client profiles. The forthcoming wire-

less access protocol (WAP) proposes the user agent profile (UAProf) includes the de-

vice profiles, which covers the hardware platform, software platform, network

characteristics, and browser (Butler, 2001).

The MPEG-7 UserPreferences DS allows users to specify their preferences (likes

and dislikes) for certain types of content and for ways of browsing (FDIS, Informa-tion Technology, 2001). To describe the types of desired content, the FilteringAnd-

SearchPreferences DS is used that consists of the creation of the content

(CreationPreferences DS ), the classification of the content (ClassificationPreferences

DS ), and the source of the content (SourcePreferences DS). To describe the ways of

browsing the selected content requested by the user, the BrowsingPreferences DS is

used along with the SummaryPreferences DS. Furthermore, each user preference

component is associated with a preference value indicating its relative importance

with respect to that of other components. For instance, we can have a user prefer-ence description to encapsulate the preference ranking among several genre catego-

ries (ClassificationPreferences DS ) that are produced in the United States in the last

decade (CreationPreferences DS ) in wide screen with Dolby AC-3 audio format

(SourcePreferences DS ). And, the user is also interested in nonlinear navigation

and access of the retrieved summarization content by specifying the preferred dura-

tion and preference ranking (SummaryPreferences DS ).

The MPEG-7 User Preferences Descriptions specifically declares the user�s prefer-ence for filtering, search, and browsing of the requested multimedia content. But,other descriptions may be required to account for the terminal, network, delivery,

and other environment parameters. The MPEG-21 Usage Environment Descriptions

cover exactly these extended requirements even thought the specific requirements

are still currently being defined and refined (Digital Item Adaptation, 2002b). The

descriptions on terminal capabilities include the device types, display characteristics,

output properties, hardware properties, software properties, and system configura-

tions. This allows the personalized content to be appropriately adapted to fit the ter-

minal. For instance, videos can be delivered to wireless devices in smaller image sizes


in a format that the device decoders can handle. The descriptions on the physical net-

work capability allow content to be dynamically adapted to the limitation of the net-

work. The descriptions on the delivery layer capabilities include the types of transport

protocols and connections. These descriptions allow users to access location-specific

applications and services. The MPEG-21 Usage Environment Descriptions also in-clude User Characteristics, that describe service capabilities, interactions and rela-

tions among users, conditional usage environment, and dynamic updating.

2.3. Content adaptability

Content adaptability refers to the multiple variations that a media can be trans-

formed into, either through changes in format, scale, rate, and/or quality. Format

transcoding may be required to accommodate the user�s terminal devices. For exam-ple, an MPEG movie cannot be rendered on the Palm III due to lack of an MPEG

decoder thus must be transcoded into bitmap images to be rendered as a slide show.

Scale conversion can represent image size resizing, video frame rate extrapolation, or

audio channel enhancement. Rate control corresponds to the data rate for transfer-

ring the media content, and may allow variable or constant rates. Quality of service

can be guaranteed to the user based on any criteria including SNR or distortion

quality measures. These adaptation operations transform the original content to effi-

ciently fit the usage environment.TheMPEG-7 media resource requirement and theMPEG-21 media resource adapt-

ability both provide descriptions for allowing certain types of adaptations. These

adaptation descriptions contain information for a single adaptation or a set of adap-

tations. The descriptions may possibly include required conditions for the adaptation,

permissions for the adaptation, and configurations for the adaptation operation.

3. System overview

Our Video Personalization and Summarization System comprises of three major

components, the user client, the database server, and the media middleware. Fig. 2

illustrates the block diagram of these components. The user client component allows

the user to specify his or her preference query along with the usage environment, and

receives the personalized content on the display client. The database server compo-

nent stores all the content sources as well as their corresponding MPEG-7 media

descriptions, MPEG-21 rights expressions, and content adaptability declarations.The media middleware represents the intermediate component of the figure, and pro-

cesses the user query and the usage environment with the media descriptions and

rights expressions to generate a personalized content, which is appropriately adapted

and transmitted to the user�s display client.

In the client end of our system, a user can request for personalized content by

specifying a user query and communicating his/her usage environment to the media

middleware. The user query takes the form of preference topics, certain keywords,

and the user�s time constraint for watching the content. The usage environment

Fig. 2. Block diagram of video personalization and summarization system. The three major components

are the user client, the database server, and the media middleware.


includes descriptions about the client terminal capabilities, physical network proper-

ties, and delivery layer characteristics. The display client can range from a pervasive

mobile device to a networked high-end workstation. The customized content is opti-

mally selected, summarized and adapted in the middleware and personally delivered

to the user.

The database server provides the media middleware with the content sources andtheir corresponding descriptions. Each content is associated with a set of media

descriptions, rights expressions, and adaptability declarations. In the database server,

the content sources can include text, audio, image, video, and graphics, and are stored

in various formats (e.g.,MPEG-1/2/4, AVI, andQuickTime). The correspondingmed-

ia descriptions include feature descriptions as well as semantic concepts. These seman-

tic descriptions can be semi-automatically generated by our VideoAnnEx MPEG-7

annotation tool (Lin et al., 2003) or automatically generated by the VideoAL

MPEG-7 automatic labeling tool (Lin et al., 2003). For each content, there may bean associated set of rights expressions on the various usages of themedia. These expres-

sions define the right for others to view, copy, print, modify, or distribute the original

content. Similarly, there is an associated set of content adaptability declarations on the

possible variation transformations that can be applied to the media.

The media middleware interfaces the user client and the database server. The mid-

dleware consists of the personalization engine and the adaptation engine. The media

descriptions and rights expressions are used by the personalization engine to opti-

mally select the required sources for the personalized content. Afterwards, the se-lected content sources, stored in SMIL, ASX, or MPEG-4 XMT format, and

corresponding content adaptability declarations are sent to the adaptation engine


to generate the personalized content. In the personalization engine, the user query

and usage environment are matched with the media descriptions and rights expres-

sions to generate the personalized content. In the adaptation engine, the results of

the personalization engine are used to retrieve the corresponding content sources

and content adaptability declarations. Our adaptation engine, powered by the Vid-

eoSue summarization modules (Tseng et al., 2002; Tseng and Lin, 2002; Tseng

and Smith, 2003), determines the optimal variation (i.e., format, size, rate, and qual-

ity) of the content for the user in accordance with the adaptability declarations and

the inherent usage environment. Subsequently, the appropriate adaptation is per-

formed on the media content and delivered to the user�s display client. Our adapta-

tion engine consists of multiple transcoders including our VideoEd editing and

composition tool (Lin et al., 2002) and Universal Tuner video transcoder (Han

et al., 2001).

4. Database server

The content sources are stored, described, analyzed and annotated with MPEG-7

and MPEG-21 descriptions. Fig. 2 illustrates the database server with four major

components. First, the database can store video sources in any format. Our media

middleware accepts various kinds of video file types, bit-rates, and resolutions. Also,each video is only required to be stored once in any format, because the middleware

can dynamically access and transcode the desired video segments in the sequences

and generate one composite video in real time. Even though each output video is

a composition of multiple video segments, our database do not need to pre-segment

the videos. This brings significant benefits to content and storage management.

Second, the database stores media descriptions in MPEG-7 format. Two tools can

be used for associating labels with content. We implemented a VideoAnnExMPEG-7

annotation tool to assist users to annotate semantic descriptions semi-automatically.The VideoAnnEx is one of the first MPEG-7 annotation tools being made publicly

available. It explores a number of interesting capabilities including automatic shot

detection, key-frame selection, automatic label propagation to visually similar shots,

and importing, editing, and customizing of ontologies. The second tool is an auto-

matic labeling tool, which is based on the pre-trained classifiers. The VideoAL ac-

cepts MPEG-1 sequence inputs and generates MPEG-7 XML metadata files based

on the prior established anchor models.

Third, the database maintains rights management on the usage of content sourcesas MPEG-21 Rights Expressions. Content owners can specify rights expressions for

an individual or group to view, copy, print, modify, and/or distribute the video. Fi-

nally, the database also stores content adaptability descriptions as MPEG-7 Media

Resource Requirement and MPEG-21 Media Resource Adaptability. Content

adaptability specifies requirements on the adaptability parameters and constraints

for changing the original content to a personalized derivative for the user. In this pa-

per, we shall focus on the details of the content adaptability functions of the system,

without further addressing the rights issues.


4.1. Video segmentation

In general, a short video clip can be annotated by simply describing its content in

its entirety. However when the video is longer, annotation of its content can benefit

from segmenting the video into smaller units. A video shot is defined as a continu-ous camera-captured segment of a scene, and is usually well defined for most video

content. Given the shot boundaries, the annotations are assigned for each video

shot.

Shot boundary detection is performed to divide the video into multiple non-over-

lapping shots. We use either the IBM CueVideo Toolkit or the built-in functionality

in VideoAnnEx to perform the shot boundary detection. The CueVideo, which is

based on the multiple timescale differencing of the color histogram, segments our vi-

deo content into shorter shots, where scene cuts, dissolves, and fades are effectivelydetected (Amir et al., 2000). Our VideoAnnEx tool can also perform the video seg-

mentation based on color histogram distributions.

4.2. Video content semantic lexicon

In general, a video shot can fundamentally be described by three attributes. The

first is the background surrounding of where the shot was captured by the camera,

which is referred to as the static scene. The second attribute is the collection of sig-nificant subjects involved in the shot sequence, which is referred to as the key object.

Lastly, the third attribute is the corresponding action taken by some of the key ob-

jects, which is referred to as the event. These three types of lexicon define the vocab-

ulary for our video content.

According to the characteristics of video corpus, a pre-defined lexicon set can be

imported into our IBM VideoAnnEx. The shots are labeled with respect to the

selected lexicon. Users can also associated each label to the region levels in the key-

frame of the shot. Note that the lexicon, whose format is compatible with MPEG-7,is dependent on the summarization application, and can be modified, imported, and

saved using the VideoAnnEx.

4.3. VideoAnnEx—MPEG-7 video annotation tool

We develop an MPEG-7 video annotation tool, VideoAnnEx, to assist authors in

annotating video sequences with MPEG-7 metadata. Each shot in the video can be

annotated according to some lexicon sets.VideoAnnEx takes an MPEG video se-quence as the required input source. It requires a corresponding shot segmentation

file, which can be loaded into the tool from other sources or generated when the in-

put video is first opened.

The VideoAnnEx annotation tool is divided into four graphical sections as illus-

trated in Fig. 3. The Video Playback window displays the video sequence. As the

video is played back in the display window, the current shot information is given

as well. The Shot Annotation module displays the defined semantic lexicons and

the key frame window. The Views Panel displays two different previews of represen-

Fig. 3. The IBM VideoAnnEx MPEG-7 video annotation tool is divided into four regions: (1) video

playback, (2) shot annot cation, (3) views panel, and (4) region annotation (not shown).


tative images of the video. The Frames in the Shot view shows all the I-frames as rep-

resentative images of the current video shot, while the Shots in the Video view (as in

the bottom of Fig. 3) shows the key frames of shots over the entire video. As the

annotator labels each shot, the descriptions are displayed below the corresponding

key frames. A more detailed description of the annotation tool as well as its active

learning components are shown in Lin et al. (2003).

Labels can be associated to either entire video shot, or the regions on the key-

frames of video shots. A video shot is defined as a continuous camera-capturedsegment of a scene. Shot boundary detection is performed to divide the video into

multiple non-overlapping shots. The VideoAnnEx performs shot boundary seg-

mentation on the compressed domain and is based on color histogram

distributions.

In general, a video shot can fundamentally be described by three attributes. The

first is the surrounding background, which is referred to as the static scene. The

second attribute is the collection of significant subjects, which is referred to as the

key object. The third attribute is the corresponding action taken by some of thekey objects, which is referred to as the event. According to the characteristics of the

video corpus, a pre-defined lexicon set can be imported into the VideoAnnEx. Note

that the lexicon, whose format is compatible with MPEG-7, is dependent on the sum-

marization application, and can be modified, imported, and saved using the

VideoAnnEx.

In our Video Personalization and Summarization System, a relevance score is

automatically assigned to the video based on the confidence value of the classifica-

tion. For our system, the annotation process generates a relevance score for thewhole video sequence and for each attribute based on the probability of that attri-

bute to the corresponding video unit. After these steps, users can manually correct

the annotation as well as the scene boundaries. All these results are then saved as

an MPEG-7 XML file.


4.4. VideoAL—MPEG-7 video automatic labeling tool

Seven modules were included in the whole system: Shot Segmentation, Region Seg-

mentation, Annotation, Feature Extraction, Model Learning, Classification, and

MPEG-7 XML Rendering. To learn concept models, in the first step, shot boundarydetection is performed on the training video set. Semantic labels are then associated

with each shot or regions using the VideoAnnEx module. A feature extraction mod-

ule extracts visual features from shots in different spatial-temporal granularity. Fi-

nally, a concept-learning module builds models for anchor concepts, e.g.,

outdoors, indoors, sky, snow, car, flag, etc. (see Fig. 4).

The first three modules of the automatic semantic labeling process are the same as

those of the training process. After features are extracted, a classification module

tests the relevance of shots with the anchor concept models and results in a confi-dence value for each concept. These output concept values are then described using

the MPEG-7 XML format. Only high confidence-value concepts are used to describe

the content. This process is executed automatically from MPEG-1 video stream to

MPEG-7 XML output. A relevance score is automatically assigned to the video shot

based on the confidence value of classification.

The system extracts two sets of visual features for each video shot. These two sets

are applied by two different modeling procedures. The first set includes: (1) color his-

togram (YCbCr, 8 · 3 · 3 dimensions), (2) auto-correlograms (YCbCr, 8 · 3 · 3dimensions), (3) edge orientation histogram (32 dimensions), (4) Dudani�s moment

invariants (6 dimensions), (5) normalized width and height of bounding box (2

dimensions), (6) co-occurrence texture (48 dimensions). These visual features are

all extracted from the keyframe. The other set of visual features include: (1) color

histogram (RGB, 8 · 8 · 8 dimensions, (2) color moments (7 dimensions), (3) coarse-

Fig. 4. System overview of VideoAL: (A) concept training process and model-building modules; (B)

detection process and labeling modules.


ness (1 dimension), (4) contrast (1 dimension), (5) directionality (1 dimension), and

(6) motion vector histogram (6 dimension). The first five features are extracted from

the keyframe. The motion features are extracted from every I and P frames of the

shot using the motion estimation method described in the region segmentation mod-

ule. The VideoAL system uses support vector machine (SVM) as the classificationmethods and ensemble fusion for classification. Detailed description of VideoAL

is shown in Lin et al. (2003).

5. Media middleware

The media middleware consists of the personalization engine and the adaptation

engine as illustrated in the system overview of Fig. 2. In the personalization engine,the user query and usage environment are matched with the media descriptions and

rights expressions to generate the personalized content. The adaptation engine deter-

mines the optimal variation (i.e., format, size, rate, and quality) of the content for the

user in accordance with the adaptability declarations and the inherent usage

environment.

5.1. Personalization engine

The objective of our system is to show a shortened video summary that maintains

as much semantic content within the desired time constraint. The VideoSue engine

performs this process by three different matching methods and is described next.

5.1.1. VideoSue—video summarization on usage environment

Our VideoSue stands for Video Summarization on Usage Environment. This en-

gine takes MPEG-7 metadata descriptions from our content sources along with the

MPEG-7/MPEG-21 user preference declarations and user time constraint to outputan optimized set of selected video segments, which will generate the desired person-

alized video summary (Tseng et al., 2002), as illustrated in Fig. 5. Using shot seg-

ments as the basic video unit, there are multiple methods of video summarization

based on spatial and temporal compression of the original video sequence. In our

work, we focus on the insertion or deletion of each video shot depending on user

preference.

Each video shot is either included or excluded from the final video summary.

In each shot, MPEG-7 metadata describes the semantic content and correspond-ing scores. Assume there are a total of N attribute categories. Let

P*

¼ ½p1; p2; . . . ; pN �Tbe the user preference vector, where pi denotes the preference

weighting for attribute i, 1 6 i 6 N. Assume there are a total of M shots. Let

S*

¼ ½s1; s2; . . . ; sM �T be the shot segments that comprise the original video sequence,

where si denotes shot number i, 1 6 i 6M. Subsequently, the attribute score ai,j is

defined as the relevance of attribute i in shot j, 1 6 i 6 M and 1 6 i 6 N. It then fol-

lows that the weighted attribute wi for shot i given the user preference P*

is calculated

as the dot product of the attribute matrix A and the user preference vector P*

:

Fig. 5. Overview of video summarization engine. Each video is segmented into shots that are annotated

with MPEG-7 descriptions. Using user�s preferences and profiles, the optimally matched set of video shots

is selected.


W*

¼

w1

w2

� � �wM

26664

37775 ¼

a1;1 a1;2 � � � a1;Na2;1 � � � � � � � � �� ai;j � � �aM ;1 � � � � � � aM ;N

26666664

37777775

�

p1p2� � �pN

26664

37775;

wi specifies the relative weighted importance of shot i with respect to the other shots.

Assume shot si spans a durations of ti, 1 6 i 6 M. Consequently, shot si is included in

the summary if the importance weighting wi of this shot is greater than some thresh-

old, wi > h, and excluded otherwise. h is determined such that the sum of the shot

durations ti is less than the user specified time constraint. Consequently, each shot

is initially ranked according to its weighted importance and either included or ex-

cluded in the final personalized video summary according to the time constraint.

Furthermore, the VideoSue engine generates the optimal set of selected video shotsfor one video as well as across multiple video sources.

5.1.2. Video summarization using visual semantic annotations and automatic speech

transcriptions

The objective of our personalization and summarization is to show a shortened

video that captures the most relevant content within the desired time constraint.

We describe another summarization algorithm using visual annotations and speech

transcripts for summarization segment unit and user preference matching as pre-sented in Tseng and Lin (2002). Metadata describing our content are available as

annotated video shot from annotation tools and recognized speech word from auto-

matic speech recognition systems. Our video summary objective is to retrieve video

shots with corresponding coherent sentence segments.


Visual shot boundaries do not necessary corresponding to semantic boundaries im-

posed by an audio sentence. Thus audio visual alignment, abbreviated as AVA, is per-

formed to determine the optimal shot-to-sentence aligned boundaries. The resulting

AVA unit captures the shot-to-sentence relationships. For instance, there can bemany

short shots required to convey one complete concept delivered by the long sentence.Another example is when one video shot is too long and need to be further segmented.

Fig. 7 shows the three shot-to-sentence relationships, where V stands for video

and A for audio. For case (1) Short Shots and Long Sentence, there are many short

shots to convey one complete concept delivered by the long sentence. Thus, the sum-

mary AVA segment selects the span of the long sentence. For case (2) Long Shot and

Short Sentences, the shot is too long and can be further segmented. Consequently,

the summary AVA segment selects only the required span of one sentence. For case

(3) Similar Shot and Sentence Lengths, the summary AVA unit is clearly defined asthe continuous camera-captured shot.

Our personalized video summary is composed of video shots and coherent sen-

tence segments. Using the MPEG-7 descriptions from the shot-based video annota-

tions and aggregated sentence transcripts, the personalization engine matches these

metadata against the user preferences and selects the optimal set of summarization

AVA units. Subsequently, each AVA unit is excluded, included, or partially included

in the final personalized video summary. The final selection is dependent on the

ranked relevance of the user preference correlation against the AVA media descrip-tions according to the time constraint as presented earlier.

5.1.3. Video summarization using hierarchical context clustering

So far our video summaries are composed of a set of independent shots or AVA

units that optimally matches the user preferences while limited by the total time con-

straint. The resulting video summaries do not take the relative temporal segment

selections into consideration. Furthermore, the selected summary segments should

convey some coherence continuity, especially for neighboring segments that maybe part of a more complete story. As a result, selecting the optimal summary seg-

ments is also dependent on the relative context. In this section, we examine a new

summarization algorithm where some hierarchical context structure is imposed on

the video (Tseng and Smith, 2003).

Well-structured video content is hierarchical in nature. This is valid for most video

sequences that have defined program structures and general production formats. For

instance, news videos follow a program structure using categories like international

politics, local news, finance, sport, etc. The news categories are then broken up intospecific news stories. Furthermore, the news stories can be decomposed into AVA

units, and finally into individual camera shots, as illustrated in Fig. 6. Similarly, user

interests can be specified in any of these granularities or levels. As a result, we base our

video summarization algorithm on the hierarchical levels of the content metadata.

Starting with the basic shot segments of a video, let us denote the value v(si) asso-

ciated with shot si to refer to the derived user value of viewing shot si. The shot si is

associated with its shot duration d(si). To determine the optimal set of shots sj for a

video summary corresponds to solving the following problem:

Fig. 6. Hierarchical decomposition of a video sequence into category segments, story segments, AVA

units, and shots.

Fig. 7. Video shot and audio sentence alignment example.


Maximize the sum of the shot values :Xj

vðsjÞ;

Constraint by the sum of the shot durations :Xj

dðsjÞ6D:

As demonstrated by Merialdo et al. (1999), the above optimization requirements

result in the Knapsack problem that derived the heuristic solution using the value

density definition:

Value density ¼ vðsiÞ=dðsiÞ:The value density for each shot reflects a higher score for the shorter shots when

the shot values are similar, and thus will be more likely to be selected for the resulting

video summary. The value density is a good approximation in solving the N-P com-

plete Knapsack problem. However, the value density does not reflect coherence be-

tween neighboring shots or shots belonging to the same story.

Let us define a hierarchy of video segment relationships, like shots, stories, andcategories. For illustration, multiple shots can belong to one story, and many sto-

ries are grouped into a category. Shots reside in level 1, stories reside in level 2,

and categories in level 3. Let s(i,l) denote the video segment where shot si resides

in level l 2 [1,L]. It follows that s(i,l = 1) corresponds to shot si, s(i,l = 2) corre-

sponds to the story covering shot si, and finally s(i,l = 3) corresponds to the cate-

gory embracing shot si. Similarly, d(i,l) denote the duration of video segment s(i,l).

Subsequently, we can compute the relative hierarchical contribution in the

following:


Coherence value for shot si ¼XL

l¼1

BðlÞXj2ði;lÞ

vðjÞ;

where B(l) is the coherence function dependent on the variable level l. In our case, we

used a decaying level coherence B(l) = (0.5)l�1. Thus the coherence value aggregates

and spreads the value of the shots over to the neighboring hierarchical levels. Thecoherence value reflects the shots relationships required for a more coherent video

summary. To reflect the duration variable in the value score computation, we finally

introduce the following:

Coherence density for shot si ¼XL

l¼1

BðlÞXj2ði;lÞ

vðjÞ=dði; lÞ:

The coherence density captures both the coherence relationship and the value den-

sity. Fig. 8 illustrates an example that demonstrates the ranked selection of shots

dependent on the value density, coherence value, and coherence density. In this

example, there are 10 shots. Shots 1–8 have a duration of 1min, d(1–8) = 1, and shots9 and 10 have a duration of 1.5min, d(9) = d(10) = 1.5. Shots 2, 5, 9, and 10 have a

user value of 1.0, v(2) = v(5) = v(9) = v(10) = 1, and the other shots have a value of 0.

In addition, shots 1 and 2 compose story I, shots 3, 4, and 5 compose story II, shots 7

and 8 compose story III, and shots 9 and 10 compose story III. Furthermore, stories

1 and 2 form category I and stories 3 and 4 give us category II. Fig. 8 shows these

parameters in the top subplots.

In Fig. 8 (left), value density is computed for each shot and their scores are gen-

erated in the second subplot. The value density is then computed against the storydurations and these scores are aggregated with the shot value densities in the third

subplot. Next, the value density is computed against the next category level and

aggregated with the story and shot densities to generate the fourth subplot. As a

Fig. 8. An example of aggregated relevance scores for hierarchical shots, stories, and categories, as they

are computed for (left) value density, (middle) coherence value, and (right) coherence value density.


result, the ranking for the segments give the following order: shot 2, (5,9,10), 1,

(3,4), (7,8), and 6. Thus the highly valued shot 2 is the first shot selected.

Similarly for Fig. 8 [middle], coherence value is computed and aggregated for

shots [second subplot], stories [third subplot], and categories [fourth subplot]. The

resulting ranking generates the following segment order: shot (9,10), (2,5), (1,3,4),(7,8), and 6. Subsequently, the highly coherent shots 9 and 10 are bumped to the

top of the list.

Finally for Fig. 8 [right], coherence density is computed and aggregated for shots

[second subplot], stories [third subplot], and categories [fourth subplot]. The result-

ing ranking generates the following segment order: shot 2, (9,10), 5, 1,(3,4), (7,8), and

6. Here, shot 2 is still the first shot selected, but shots 9 and 10 are closely ranked

behind. In conclusion, coherence density reflects an effective balance between the

individual shot value contributions while taking into account the hierarchical coher-ence segments of neighboring shots.

5.2. Adaptation engine

The objective of the adaptation engine is to perform the optimal set of transfor-

mations on the selected content in accordance with the adaptability declarations and

the inherent usage environment. The adaptation engine must be equipped to perform

transcoding, filtering, and scaling such that the user can play the final adapted per-sonalized content. We have implemented two adaptation engines, called the VideoEd

and Universal Tuner.

5.2.1. VideoEd—real-time video editing and composition tool for content adaptation

Our VideoEd is a novel MPEG system layer compressed-domain editing and com-

position technique used to facilitate the delivery and integration of multiple segments

of MPEG files, residing on remote databases (Lin et al., 2002). Various multimedia

applications, including retrieval and summarization, split MPEG files into small seg-ments along shot boundaries and store them separately. This traditional method re-

quires extra management and storage payload, provides only fixed segmentations,

and may not play smoothly. In order to solve this problem, our adaptation engine

directly extracts the desired video-audio information from the original MPEG

sources and combines them to generate a single MPEG file. Manipulated wholly

in the system bitstream domain, this method does not require decoding, re-encoding,

and re-synchronization of audio and video data. Thus, it operates in real-time and

provides great flexibility. This composite MPEG file can be transmitted and dis-played through general Web interfaces.

Fig. 9 shows the flowchart of the editing tool, VideoEd. This tool takes three kinds

of input: MPEG files, Frame Map files, and Retrieval List. The first two are stored in

a database. Frame Map files, which indicate the accessible points of the pack header

of GOPs, are used to indicate the randomly accessible positions in an MPEG system

stream. The Retrieval List is the XML file generated by the personalization engine.

The highlighted area in Fig. 9 shows the case of concatenating multiple video-au-

dio segments from an MPEG file to generate a new one. The first step is to copy and

Fig. 9. Basic structure of VideoEd system compressed-domain editing and composition tool.


transfer the first one or two packs to a new file. The number of copied packs depends

on whether the global information of video and audio are stored separately and the

number of packets in a pack. Next, we need to extract video and audio sequence

headers. In an MPEG-1 bitstream, there can be more than one video sequence head-

er (VSH). In most cases there is either one VSH for the whole video sequence or mul-tiple VSHs before each GOP. Note that a VSH has to be placed immediately before a

GOP. Audio sequence headers have the same properties as that of video.

The content of audio and video data is copied and pasted in the pack levels.

According to the retrieval list, we then choose the accessible positions indicated by

the Frame Map files. Note that the resolution of segmentation has to be in the

GOP. In other words, video and audio can only be cut and pasted at a resolution

of nearly 0.5s. After this step, we need to clean up the bits that belong to the previ-

ous or next GOP. If there are multiple packets in a pack, we delete the unnecessarypackets. After cleaning up, we insert the video sequence header to the beginning of

the first video segments. Similar cleaning process is performed for audio packets.

The final step of VideoEd is the manipulation of time codes. They must be done in

the system pack and packet layers. They new sequence is regarded valid regardless of

the time code changes at the GOP headers. To concatenate two clips, we have to cal-

culate a temporal offset from the difference of the first SCR in the new clip to the

SCR of the last clip, and subtract it from the SCR of all the packs. We have to

change PTS and DTS of all the packets in the second clip using a similar strategy.

5.2.2. Universal Tuner—format and scale transcoder for pervasive devices

The other component in our adaptation is the Universal Tuner that serves as an

adaptive transcoder for pervasive devices (Han et al., 2001). In the Universal Tuner,

the complexity of multimedia compression and decompression algorithms is adap-

tively partitioned between the encoder and decoder. A mobile client would selectively


disable or re-enable stages of the algorithm to adapt to the device�s effective process-ing capability. Our variable-complexity strategy of selective disabling of modules

supports graceful degradation of the complexity of multimedia coding and decoding

into a mobile client�s low-power mode, i.e., the clock frequency of its next-generation

low power CPU has been scaled down to conserve power.The input of the transcoder is a stream of video frames of MPEG-1/2/4 file. The

video frames are first decoded and resampled to either 80 · 80 or 160 · 160 image.

Then, depending on whether the PDA device is Black and White or 256 colors (as

in Palm IIIc), the color RGB frames are either dithered using halftoning or mapped

to 256 colors using a shot-based optimal color map. In the color mode, we added a

shot boundary detection functionality in order to update the color codebook for

each new shot. After this step, the transcoder can selectively to enable an entropy

coding process and a frame differencing process to further reduce the bit rates.

6. Personalization applications and user clients

The user client allows the user to retrieve specific content via a user query, provide

critical persistent data through the usage environment, and receive the personalized

content on the user�s display client. The usage environment holds the profiles about

the user, device, network, delivery, and other environments. This usage environmentdescriptions are sent over to the personalization engine of the media middleware,

and may include some user profile (i.e., English speaking American, user residing

in New York City), device profile (i.e., Palm IIIc with limited color display, no audio

capabilities, and memory capacity), and transmission profile (i.e., internet access

through 56K modem.) They allow the middleware to dynamically perform the

appropriate content adaptation on the requested media. Thus the delivered multime-

dia can be personalized for the usage environment.

The user submits a request for some content via a user query. The query can takethe form of topic preferences, search keywords, media samples, and/or time con-

straints. The user query then initiates the retrieval for customized content by the

media middleware. Following, the personalized multimedia is delivered from the

adaptation engine of the middleware to the user and rendered on the display client

in accordance with the usage environment and the user query. If the query is in the

form of free keywords, however, in our current implementations, only the keywords

that match (manual or automatic) annotations can be retrieved. In the rest of this

section, we present two personalization applications.

6.1. User clients on PCs using IBM Websphere Portal Server

The IBM Websphere Portal Server (WPS) product provides a secure, single point

of access to diverse information and applications, personalized to the needs of their

users. The WPS uses portals to present users with personalized content according to

authenticated user profiles. WPS can also be deployed on many categories of portals

that allow access from a wide variety of desktops and mobile devices. This frame-


work provides us with an open, flexible, and scalable infrastructure for creating and

deploying our rich media personalization system with usage environment descrip-

tions. Within WPS, we have developed the personalization and summarization

portal.

When a user establishes an account with the WPS, the user profiles are created andcan be modified later. In our portal, the preference interests of the user are gathered.

This includes the topic category and default viewing time. Fig. 10 illustrates the user

client portals for three different topic categories of News, Entertainment, and Educa-

tion. Thus the appropriate portal page according to the user profile would automat-

ically be presented when the user logs on to its portal. This illustrates the use of

persistent usage environment within WPS. Assuming our user has stored the desired

topic category to be Entertainment. Then Fig. 11 (left) shows the corresponding portal

interface of the Late Night Show, which is ranked in the Entertainment category.

Fig. 10. User client portals corresponding to three different usage environment profiles for: (1) news (left),

(2) entertainment (middle), and (3) education (right).

Fig. 11. The usage example and screen shots of the personalized video summarization for user clients on

PC.


After our user enters the portal page, she can request for a personalized video

summary according to her current interests. There are two preferences and the time

limit for her selection. For the Late Night Show, the preference choices range from

‘‘David_Letterman,’’ ‘‘Paul_Shaffer,’’ to ‘‘Monologue,’’ ‘‘Guest_Interview,’’ and

‘‘Top_10.’’ Our user submits the user query preferences: (1) ‘‘Guest_Introduction’’and (2) ‘‘David_Letterman,’’ along with a time constraint of 120s. As a result, the

VideoSue summarization and personalization engine retrieves the optimal set of

highly relevant video segments that included David Letterman introducing many

guests from three video sources. Subsequently, the VideoEd composition tool gathers

this set of selected video clips and dynamically generates one personalized video mo-

vie. Fig. 11 (right) depicts four screen shots of guests as they are introduced.

6.2. User clients on personal digital assistance (PDA) devices

Using the Universal Tuner adaptation engine, we can transcode a summarized vi-

deo and transmit it to pervasive devices. The client can access personalized video

with one of these three modes: (a) video-on-demand with interactive hyperlinks,

(b) summarized video based on preference topics and time constraint, and (c) sum-

marized video based on query keywords and time constraint. Fig. 12 illustrates the

user client in mode: (a) on the left, mode (b) in the middle, and mode (c) on the right.

A user can first click on the Get Channels button to get a list of hyper-links accord-ing to the stored user profile. This list is then displayed on the left of the video play-

back window. He can then select to view the video by clicking on the hyperlinks. In

another scenario, he may then want to see a summarized video based on the prefer-

ence choices that were sent from the middleware when he selects a video. Then, based

on his preferences and time constraint, he will receive a summarized video, which

best matches the user preferences and profiles. In addition to the preferences that

were recommended by the middleware, users could also key in some keywords

and let the middleware find the best matches and generate the personalized videosummary.

Fig. 12. The user client interface on the palm OS PDA for three personalized video scenarios: (A) video-

on-demand with interactive hyperlinks, (B) summarized video based on preference topics and time

constraint, and (C) summarized video based on query keywords and time constraint.


7. Conclusion

A video personalization and summarization system is presented matching media

descriptions with usage environments in a server–middleware–client framework in

order to deliver personalized media contents to users. The three-tier architecture pro-vides a standards-compliant infrastructure using our tools and engines to select,

adapt, and deliver personalized video summaries to the users effectively. The Video-

AnnEx MPEG-7 Video Annotation Tool and VideoAl Automatic Labeling Tool al-

lows annotators to describe video segments using semantic concepts. The VideoSue

Summarization on Usage Environment Engine determines the optimal selection of

media contents according to user preferences and time constraint. The VideoEd Edit-

ing and Composition Tool and Universal Tuner Tool gather the selected contents and

generates one personalized video for the user. Our personalization and summariza-tion systems are demonstrated on top of the IBM Websphere Portal Server (WPS)

and for PDA devices. In the future, we will add the functionalities that allow users

to input any query keywords, and then let VideoSue to match them with the existing

ones according their semantic distance. We will also test the performance of other

possible fusion methods, in addition to the weighting method that is current used.

We will also work on increasing the coverage and accuracy of the concept models

in VideoAl, and conduct more experiments on the subjective user satisfaction studies.

References

Available from: <http://www.ibm.com>.

Available from: <http://www.nttdocomo.co.jp>.

Available from: <http://www.virage.com>.

Aizawa, K., Shijima, K.I., Shiina, M., 2001. Summarizing wearable video. IEEE ICIP, Thessaloniki,

Greece, October 2001.

Amir, A., Ponceleon, D., Blanchard, B., Petkovic, D., Srinivasan, S., Cohen, G., 2000. Using audio time

scale modification for video browsing. In: Hawaii International Conference on System Sciences,

HICSS-33, Maui, January.

Butler, M.H., 2001. Implementing content negotiation using CC/PP and WAP UAProf, Technical Report

HPL-2001-190, June 2001. Available from: <http://www.hpl.hp.com/techreports/2001/

HTTPCCPP_2001-190.html>.

Gong, Y., Liu, X., 2001. Summarizing video by minimizing visual content redundancies. IEEE ICME,

Tokyo, August 2001.

Han, R., Lin, C.-Y., Smith, J.R., Tseng, B.L., Ha, V., 2001. Universal tuner: a video streaming system for

CPU/power-constrained mobile devices. ACM Multimedia 2001, Ottawa, Canada.

ISO/IEC JTC 1/SC 29/WG 11/N 4242, Text of 15938-5 FDIS, Information Technology—Multimedia

Content Description Interface—Part 5 Multimedia Description Schemes, Final Document Interna-

tional Standard (FDIS) edition, October 2001.

ISO/IEC JTC1/SC29/WG11/M8321, MPEG-7 Tools for MPEG-21 Digital Item Adaptation, Fairfax, VA,

May 2002.

ISO/IEC JTC1/SC29/WG11/N5354, MPEG-21 Digital Item Adaptation, Awaji Island, Japan, December

2002.

Lin, C.-Y., Tseng, B.L., Smith, J.R., 2002. Universal MPEG content access using compressed-domain

system stream editing techniques. In: IEEE International Conference on Multimedia and Expos,

Switzerland, August.

http://www.ibm.com

http://www.nttdocomo.co.jp

http://www.virage.com

http://www.hpl.hp.com/techreports/2001/HTTPCCPP

http://www.hpl.hp.com/techreports/2001/HTTPCCPP


Lin, C.-Y., Tseng, B.L., Smith, J.R., 2003. VideoAnnEx: IBM MPEG-7 annotation tool. In: IEEE

International Conference on Multimedia and Expo, Baltimore.

Lin, C.-Y., Tseng, B.L., Naphade, M., Natsev, A., Smith, J.R., 2003. VideoAL: a novel end-to-end

MPEG-7 video automatic labeling system. In: IEEE International Conference on Image Processing,

Barcelona, September.

Ma, Y.F., Zhang, H.J., 2002. A model of motion attention for video skimming. International Conference

on Image Processing, Rochester, New York, September.

Manjunath, B.S., Salembier, P., Sikora, T., 2002. Introduction to MPEG-7: Multimedia Content

Description Language. Wiley, New York.

Merialdo, B., Lee, K.T., Luparello, D., Roudaire, J., 1999. Automatic construction of personalized TV

news programs. In: ACM Multimedia 1999, Orlando, FL, September.

Sundaram, H., Chang, S.F., 2000. Determining computable scenes in films and their structures using

audio-visual memory models. ACM Multimedia 2000, Los Angeles, CA.

Tseng, B.L., Lin, C.-Y., Smith, J.R., 2002. Video summarization and personalization for pervasive mobile

devices. In: SPIE Electronic Imaging 2002—Storage and Retrieval for Media Databases, San Jose,

January.

Tseng, B.L., Lin, C.-Y., 2002. Personalized video summary using visual semantic annotations and

automatic speech transcriptions. In: IEEE Multimedia Signal Processing MMSP 2002, St. Thomas,

December.

Tseng, B.L., Smith, J.R., 2003. Hierarchical video summarization based on context clustering. In:

International Symposium on ITCom 2003—Internet Multimedia Management Systems, Orlando, FL.

Yeung, M., Yeo, B., Wolf, W., Liu, B. Video browsing using clustering and scene transitions on

compressed sequences. In: Proceedings of the SPIE on Multimedia Computing and Networking, vol.

2417, 1995.

Documents

Video personalization and summarization system for usage environment