Upload
belle-l-tseng
View
217
Download
1
Embed Size (px)
Citation preview
www.elsevier.com/locate/jvci
J. Vis. Commun. Image R. 15 (2004) 370–392
Video personalization and summarizationsystem for usage environment
Belle L. Tseng*, Ching-Yung Lin, John R. Smith
IBM T.J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY 10532, USA
Received 5 June 2003; accepted 5 April 2004
Abstract
A video personalization and summarization system is designed and implemented incorpo-
rating usage environment to dynamically generate a personalized video summary. The person-
alization system adopts the three-tier server–middleware–client architecture in order to select,
adapt, and deliver rich media content to the user. The server stores the content sources along
with their corresponding MPEG-7 metadata descriptions. Our semantic metadata is provided
through the use of the VideoAnnEx MPEG-7 Video Annotation Tool. When the user initiates
a request for content, the client communicates the MPEG-21 usage environment description
along with the user query to the middleware. The middleware is powered by the personaliza-
tion engine and the content adaptation engine. Our personalization engine includes the Video-
Sue Summarization on Usage Environment engine that selects the optimal set of desired
contents according to user preferences. Afterwards, the adaptation engine performs the
required transformations and compositions of the selected contents for the specific usage envi-
ronment using our VideoEd Editing and Composition Tool. Finally, two personalization and
summarization systems are demonstrated for the IBM Websphere Portal Server and for per-
vasive PDA devices.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Video personalization; Summarization; Adaptation; Usage environment; MPEG-7; MPEG-21;
Annotation; Rich media description; Digital item adaptation; Rights expression
1047-3203/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.jvcir.2004.04.011
* Corresponding author.
E-mail addresses: [email protected] (B.L. Tseng), [email protected] (C.-Y. Lin), [email protected].
com (J.R. Smith).
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 371
1. Introduction
With the growing amount of multimedia content, people become more willing
to view personalized multimedia based on their usage environments. When peo-
ple use their pervasive devices, they generally restrict their viewing time on thelimited displays and minimize the amount of interaction and navigation to get
to the content. When they browse video on the Internet, they may want to
get only the videos that match their preferences. Because of the existence of het-
erogeneous user clients and data sources, it is a real challenge to implement a
universally compliant system that fits various usage environments. Most existing
video summarization tools address their applications on proprietary environ-
ments. Some systems display summarized videos using a number of key frames
for each detected scene shot to generate a storyboard (Yeung et al., 1995). Mer-ialdo et al. (1999) maximizes story content value to generate personalized TV
news programs based on user preference and time constraint. Sundaram and
Chang (2000) incorporates film structure rules to detect coherent summarization
segments. Gong and Liu (2001) use singular value decomposition (SVD) of attri-
bute matrix to reduce the redundancy of video segments and thus generate video
summaries. In addition to the systems based on audio-visual information, some
researchers have proposed methods to detect semantic important events based
on other resources. For instance, Aizawa et al. (2001) use brain waves to detectexciting moments of the subjects. Ma and Zhang (2002) genetrate video summary
by using user attention models based on multiple sensory perception. In industry
sectors, companies such as NTT DoCoMo and Virage have implemented com-
mercialized video summarization systems. Japanese Wireless carrier NTT DoCo-
Mo streams assorted video summaries to its I-mode cellular phones over the 3G
network. Virage provides video content management and creates personalized vi-
deo summaries including NHL hockey highlights, NASCAR racing and ABC
News.This paper addresses issues of designing a video personalization and summariza-
tion system in heterogeneous usage environments and provides a tutorial introduc-
tion on their associated issues in MPEG-7 and MPEG-21. The server maintains
the content sources, the MPEG-7 metadata descriptions, the MPEG-21 rights
expressions, and content adaptability declarations. The client communicates the
MPEG-7 user preference, MPEG-21 usage environments, and user query so as to re-
trieve and display the personalized content. The middleware is powered by the per-
sonalization engine and the adaptation engine. The personalization andsummarization system adopts the three-tier server–middleware–client architecture
in order to select, adapt, and deliver rich media content to the user.
This paper is organized as follows. In Section 2, we describe the personalization
and summarization framework that is currently evolving in MPEG-21 and what
had been included in MPEG-7. In Section 3, we describe our video personalization
and summarization system, which includes the user client, the personalization and
adaptation media middleware, and the database server. Details of these three tiers
are described in Sections 4–6.
372 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
2. Digital item adaptation using MPEG-7 and MPEG-21
To personalize delivery and consumption of multimedia content for the individual
users, a phrase coined by the ISO/IEC MPEG-21 group called Digital Item Adapta-
tion has been identified as the essential framework for personalization. Adaptationaims at providing the appropriate content types to the different terminals via the pre-
ferred networks. MPEG-21 Digital Item Adaptation provides the tools to support re-
source adaptation of content, descriptor adaptation through metadata, and quality
of service management (Digital Item Adaptation, 2002a,b).
Fig. 1 illustrates a fundamental structure for Digital Item Adaptation. The content
sources may include text, audio, images, and/or video. Associated with each source is
the corresponding digital item declaration that identifies and describes the content.
MPEG-21 Digital Item Declaration (DID) provides standard description schemesthat identify the content and include components for resource adaptation, rights
expression, and adaptation rights. Under MPEG-21 DID, it has a standard scheme
for content adaptability called Digital Item Resource Adaptation. While MPEG-7 is
used to describe content using features, semantics, structures, and models (FDIS,
Information Technology, 2001; Manjunath et al., 2002), it also includes adaptation
hints, covered under the MPEG-7 Media Resource Requirements, which overlaps
with MPEG-21. MPEG-7 also includes adaptation hints and that are covered under
the MPEG-7 Media Resource Requirements. In addition, MPEG-21 has a similar
Fig. 1. Block diagram of digital item adaptation using MPEG-7 and MPEG-21. The personalization and
adaptation engines match the content sources and corresponding digital item declarations with the user
query and usage environment through a controlled term list in order to: (1) optimally select the set of
personal content and (2) effectively adapt that dynamic content for the user.
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 373
requirement for content adaptability called MPEG-21 Media Resource Adaptability,
and thus overlaps with MPEG-7 as illustrated in Fig. 1.
Rights expression is another subcomponent of MPEG-21 Digital Item Declaration
and describes the rights associated with multimedia content. They include the rights
and permissions for an individual or designated group to view, modify, copy or dis-tribute the content. Among these expressions that are relevant to personalization is
adaptation rights. For example, the owner of an image may allow cropping of the im-
age to fit the desired display size, but not scaling.
From the perspective of the user, the usage environment defines the user profiles,
terminal properties, network characteristics, and other user environments.MPEG-21
Usage Environment is in the process of defining specific requirements to describe such
conditions. The usage environment also includes user preferences, which can partially
or fully be described by the MPEG-7 User Preferences and is adopted as a subcom-ponent of MPEG-21 Usage Environment (Digital Item Adaptation, 2002a). Further-
more, the user can request for a specific content in a special application not defined
by the standard. This request is represented as the user query in Fig. 1.
To match the user�s usage environment with the content�s digital item declaration,
the adaptation engine is introduced to personally select the desired content and opti-
mally adapt those content in accordance with the descriptions. Specifically, users
specify their requests for content through the user query and usage environment.
On the other hand, the digital item declaration encompasses both the rights expres-sion and the content adaptability of the available content. To assist the matching
process, a common language can be provided through the MPEG-7 Controlled Term
List. This list allows specific domain applications to define their taxonomy, their
relationships, and their importance. In summary, the adaptation engine takes the
user query, the usage environment, the digital item declaration, and the controlled
term list to adapt the content sources to generate the personalized content.
In the rest of this section, we describe more details of the MPEG-7 and MPEG-21
components in digital item adaptation for usage environment, media description,rights expression, and content adaptability.
2.1. Media descriptions
Media descriptions identify and describe the multimedia content from different
abstraction levels of low-level features to various interpretations of semantic con-
cepts. MPEG-7 provides description schemes (DS) to describe content in XML to
facilitate search, index, and filtering of audio-visual data (FDIS, Information Tech-nology, 2001; Manjunath et al., 2002). The DSs are designed to describe both the
audio-visual data management and the specific concepts and features. The data man-
agement descriptions include metadata about creation, production, usage, and man-
agement. The concepts and features metadata may include what the scene is about in
a video clip, what objects are present in the scene, who is talking in the conversation,
and what is the color distribution of an image.
MPEG-7 standardizes the description structure; however, technical challenges re-
main. The generation of these descriptions is not part of the MPEG-7 standard and
374 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
the technologies for generating them are variable, competitive, and some—non-exis-
tent. As such we have developed an annotation tool to describe the semantic mean-
ing of video clips to capture the underlying semantic meaning by a human. These
annotators are assisted in their annotation task through the use of a finite vocabulary
set. This set can be readily represented by the MPEG-7 Controlled Term List, whichcan be customized for the different domains and applications. For example in a news
program, the list can include the anchors� names, topic categories, and camera cap-
tures. Our annotation tool is described in Section 4.
2.2. Usage environment
The usage environment holds the profiles about the user, device, network, deliv-
ery, and other environments. This information is used to determine the optimal con-tent selection and the most appropriate form for the user. Other than MPEGs,
several standards have been proposed. HTTP/1.1 uses the composite capabilities/
preference profile (CC/PP) to communicate the client profiles. The forthcoming wire-
less access protocol (WAP) proposes the user agent profile (UAProf) includes the de-
vice profiles, which covers the hardware platform, software platform, network
characteristics, and browser (Butler, 2001).
The MPEG-7 UserPreferences DS allows users to specify their preferences (likes
and dislikes) for certain types of content and for ways of browsing (FDIS, Informa-tion Technology, 2001). To describe the types of desired content, the FilteringAnd-
SearchPreferences DS is used that consists of the creation of the content
(CreationPreferences DS ), the classification of the content (ClassificationPreferences
DS ), and the source of the content (SourcePreferences DS). To describe the ways of
browsing the selected content requested by the user, the BrowsingPreferences DS is
used along with the SummaryPreferences DS. Furthermore, each user preference
component is associated with a preference value indicating its relative importance
with respect to that of other components. For instance, we can have a user prefer-ence description to encapsulate the preference ranking among several genre catego-
ries (ClassificationPreferences DS ) that are produced in the United States in the last
decade (CreationPreferences DS ) in wide screen with Dolby AC-3 audio format
(SourcePreferences DS ). And, the user is also interested in nonlinear navigation
and access of the retrieved summarization content by specifying the preferred dura-
tion and preference ranking (SummaryPreferences DS ).
The MPEG-7 User Preferences Descriptions specifically declares the user�s prefer-ence for filtering, search, and browsing of the requested multimedia content. But,other descriptions may be required to account for the terminal, network, delivery,
and other environment parameters. The MPEG-21 Usage Environment Descriptions
cover exactly these extended requirements even thought the specific requirements
are still currently being defined and refined (Digital Item Adaptation, 2002b). The
descriptions on terminal capabilities include the device types, display characteristics,
output properties, hardware properties, software properties, and system configura-
tions. This allows the personalized content to be appropriately adapted to fit the ter-
minal. For instance, videos can be delivered to wireless devices in smaller image sizes
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 375
in a format that the device decoders can handle. The descriptions on the physical net-
work capability allow content to be dynamically adapted to the limitation of the net-
work. The descriptions on the delivery layer capabilities include the types of transport
protocols and connections. These descriptions allow users to access location-specific
applications and services. The MPEG-21 Usage Environment Descriptions also in-clude User Characteristics, that describe service capabilities, interactions and rela-
tions among users, conditional usage environment, and dynamic updating.
2.3. Content adaptability
Content adaptability refers to the multiple variations that a media can be trans-
formed into, either through changes in format, scale, rate, and/or quality. Format
transcoding may be required to accommodate the user�s terminal devices. For exam-ple, an MPEG movie cannot be rendered on the Palm III due to lack of an MPEG
decoder thus must be transcoded into bitmap images to be rendered as a slide show.
Scale conversion can represent image size resizing, video frame rate extrapolation, or
audio channel enhancement. Rate control corresponds to the data rate for transfer-
ring the media content, and may allow variable or constant rates. Quality of service
can be guaranteed to the user based on any criteria including SNR or distortion
quality measures. These adaptation operations transform the original content to effi-
ciently fit the usage environment.TheMPEG-7 media resource requirement and theMPEG-21 media resource adapt-
ability both provide descriptions for allowing certain types of adaptations. These
adaptation descriptions contain information for a single adaptation or a set of adap-
tations. The descriptions may possibly include required conditions for the adaptation,
permissions for the adaptation, and configurations for the adaptation operation.
3. System overview
Our Video Personalization and Summarization System comprises of three major
components, the user client, the database server, and the media middleware. Fig. 2
illustrates the block diagram of these components. The user client component allows
the user to specify his or her preference query along with the usage environment, and
receives the personalized content on the display client. The database server compo-
nent stores all the content sources as well as their corresponding MPEG-7 media
descriptions, MPEG-21 rights expressions, and content adaptability declarations.The media middleware represents the intermediate component of the figure, and pro-
cesses the user query and the usage environment with the media descriptions and
rights expressions to generate a personalized content, which is appropriately adapted
and transmitted to the user�s display client.
In the client end of our system, a user can request for personalized content by
specifying a user query and communicating his/her usage environment to the media
middleware. The user query takes the form of preference topics, certain keywords,
and the user�s time constraint for watching the content. The usage environment
Fig. 2. Block diagram of video personalization and summarization system. The three major components
are the user client, the database server, and the media middleware.
376 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
includes descriptions about the client terminal capabilities, physical network proper-
ties, and delivery layer characteristics. The display client can range from a pervasive
mobile device to a networked high-end workstation. The customized content is opti-
mally selected, summarized and adapted in the middleware and personally delivered
to the user.
The database server provides the media middleware with the content sources andtheir corresponding descriptions. Each content is associated with a set of media
descriptions, rights expressions, and adaptability declarations. In the database server,
the content sources can include text, audio, image, video, and graphics, and are stored
in various formats (e.g.,MPEG-1/2/4, AVI, andQuickTime). The correspondingmed-
ia descriptions include feature descriptions as well as semantic concepts. These seman-
tic descriptions can be semi-automatically generated by our VideoAnnEx MPEG-7
annotation tool (Lin et al., 2003) or automatically generated by the VideoAL
MPEG-7 automatic labeling tool (Lin et al., 2003). For each content, there may bean associated set of rights expressions on the various usages of themedia. These expres-
sions define the right for others to view, copy, print, modify, or distribute the original
content. Similarly, there is an associated set of content adaptability declarations on the
possible variation transformations that can be applied to the media.
The media middleware interfaces the user client and the database server. The mid-
dleware consists of the personalization engine and the adaptation engine. The media
descriptions and rights expressions are used by the personalization engine to opti-
mally select the required sources for the personalized content. Afterwards, the se-lected content sources, stored in SMIL, ASX, or MPEG-4 XMT format, and
corresponding content adaptability declarations are sent to the adaptation engine
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 377
to generate the personalized content. In the personalization engine, the user query
and usage environment are matched with the media descriptions and rights expres-
sions to generate the personalized content. In the adaptation engine, the results of
the personalization engine are used to retrieve the corresponding content sources
and content adaptability declarations. Our adaptation engine, powered by the Vid-
eoSue summarization modules (Tseng et al., 2002; Tseng and Lin, 2002; Tseng
and Smith, 2003), determines the optimal variation (i.e., format, size, rate, and qual-
ity) of the content for the user in accordance with the adaptability declarations and
the inherent usage environment. Subsequently, the appropriate adaptation is per-
formed on the media content and delivered to the user�s display client. Our adapta-
tion engine consists of multiple transcoders including our VideoEd editing and
composition tool (Lin et al., 2002) and Universal Tuner video transcoder (Han
et al., 2001).
4. Database server
The content sources are stored, described, analyzed and annotated with MPEG-7
and MPEG-21 descriptions. Fig. 2 illustrates the database server with four major
components. First, the database can store video sources in any format. Our media
middleware accepts various kinds of video file types, bit-rates, and resolutions. Also,each video is only required to be stored once in any format, because the middleware
can dynamically access and transcode the desired video segments in the sequences
and generate one composite video in real time. Even though each output video is
a composition of multiple video segments, our database do not need to pre-segment
the videos. This brings significant benefits to content and storage management.
Second, the database stores media descriptions in MPEG-7 format. Two tools can
be used for associating labels with content. We implemented a VideoAnnExMPEG-7
annotation tool to assist users to annotate semantic descriptions semi-automatically.The VideoAnnEx is one of the first MPEG-7 annotation tools being made publicly
available. It explores a number of interesting capabilities including automatic shot
detection, key-frame selection, automatic label propagation to visually similar shots,
and importing, editing, and customizing of ontologies. The second tool is an auto-
matic labeling tool, which is based on the pre-trained classifiers. The VideoAL ac-
cepts MPEG-1 sequence inputs and generates MPEG-7 XML metadata files based
on the prior established anchor models.
Third, the database maintains rights management on the usage of content sourcesas MPEG-21 Rights Expressions. Content owners can specify rights expressions for
an individual or group to view, copy, print, modify, and/or distribute the video. Fi-
nally, the database also stores content adaptability descriptions as MPEG-7 Media
Resource Requirement and MPEG-21 Media Resource Adaptability. Content
adaptability specifies requirements on the adaptability parameters and constraints
for changing the original content to a personalized derivative for the user. In this pa-
per, we shall focus on the details of the content adaptability functions of the system,
without further addressing the rights issues.
378 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
4.1. Video segmentation
In general, a short video clip can be annotated by simply describing its content in
its entirety. However when the video is longer, annotation of its content can benefit
from segmenting the video into smaller units. A video shot is defined as a continu-ous camera-captured segment of a scene, and is usually well defined for most video
content. Given the shot boundaries, the annotations are assigned for each video
shot.
Shot boundary detection is performed to divide the video into multiple non-over-
lapping shots. We use either the IBM CueVideo Toolkit or the built-in functionality
in VideoAnnEx to perform the shot boundary detection. The CueVideo, which is
based on the multiple timescale differencing of the color histogram, segments our vi-
deo content into shorter shots, where scene cuts, dissolves, and fades are effectivelydetected (Amir et al., 2000). Our VideoAnnEx tool can also perform the video seg-
mentation based on color histogram distributions.
4.2. Video content semantic lexicon
In general, a video shot can fundamentally be described by three attributes. The
first is the background surrounding of where the shot was captured by the camera,
which is referred to as the static scene. The second attribute is the collection of sig-nificant subjects involved in the shot sequence, which is referred to as the key object.
Lastly, the third attribute is the corresponding action taken by some of the key ob-
jects, which is referred to as the event. These three types of lexicon define the vocab-
ulary for our video content.
According to the characteristics of video corpus, a pre-defined lexicon set can be
imported into our IBM VideoAnnEx. The shots are labeled with respect to the
selected lexicon. Users can also associated each label to the region levels in the key-
frame of the shot. Note that the lexicon, whose format is compatible with MPEG-7,is dependent on the summarization application, and can be modified, imported, and
saved using the VideoAnnEx.
4.3. VideoAnnEx—MPEG-7 video annotation tool
We develop an MPEG-7 video annotation tool, VideoAnnEx, to assist authors in
annotating video sequences with MPEG-7 metadata. Each shot in the video can be
annotated according to some lexicon sets.VideoAnnEx takes an MPEG video se-quence as the required input source. It requires a corresponding shot segmentation
file, which can be loaded into the tool from other sources or generated when the in-
put video is first opened.
The VideoAnnEx annotation tool is divided into four graphical sections as illus-
trated in Fig. 3. The Video Playback window displays the video sequence. As the
video is played back in the display window, the current shot information is given
as well. The Shot Annotation module displays the defined semantic lexicons and
the key frame window. The Views Panel displays two different previews of represen-
Fig. 3. The IBM VideoAnnEx MPEG-7 video annotation tool is divided into four regions: (1) video
playback, (2) shot annot cation, (3) views panel, and (4) region annotation (not shown).
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 379
tative images of the video. The Frames in the Shot view shows all the I-frames as rep-
resentative images of the current video shot, while the Shots in the Video view (as in
the bottom of Fig. 3) shows the key frames of shots over the entire video. As the
annotator labels each shot, the descriptions are displayed below the corresponding
key frames. A more detailed description of the annotation tool as well as its active
learning components are shown in Lin et al. (2003).
Labels can be associated to either entire video shot, or the regions on the key-
frames of video shots. A video shot is defined as a continuous camera-capturedsegment of a scene. Shot boundary detection is performed to divide the video into
multiple non-overlapping shots. The VideoAnnEx performs shot boundary seg-
mentation on the compressed domain and is based on color histogram
distributions.
In general, a video shot can fundamentally be described by three attributes. The
first is the surrounding background, which is referred to as the static scene. The
second attribute is the collection of significant subjects, which is referred to as the
key object. The third attribute is the corresponding action taken by some of thekey objects, which is referred to as the event. According to the characteristics of the
video corpus, a pre-defined lexicon set can be imported into the VideoAnnEx. Note
that the lexicon, whose format is compatible with MPEG-7, is dependent on the sum-
marization application, and can be modified, imported, and saved using the
VideoAnnEx.
In our Video Personalization and Summarization System, a relevance score is
automatically assigned to the video based on the confidence value of the classifica-
tion. For our system, the annotation process generates a relevance score for thewhole video sequence and for each attribute based on the probability of that attri-
bute to the corresponding video unit. After these steps, users can manually correct
the annotation as well as the scene boundaries. All these results are then saved as
an MPEG-7 XML file.
380 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
4.4. VideoAL—MPEG-7 video automatic labeling tool
Seven modules were included in the whole system: Shot Segmentation, Region Seg-
mentation, Annotation, Feature Extraction, Model Learning, Classification, and
MPEG-7 XML Rendering. To learn concept models, in the first step, shot boundarydetection is performed on the training video set. Semantic labels are then associated
with each shot or regions using the VideoAnnEx module. A feature extraction mod-
ule extracts visual features from shots in different spatial-temporal granularity. Fi-
nally, a concept-learning module builds models for anchor concepts, e.g.,
outdoors, indoors, sky, snow, car, flag, etc. (see Fig. 4).
The first three modules of the automatic semantic labeling process are the same as
those of the training process. After features are extracted, a classification module
tests the relevance of shots with the anchor concept models and results in a confi-dence value for each concept. These output concept values are then described using
the MPEG-7 XML format. Only high confidence-value concepts are used to describe
the content. This process is executed automatically from MPEG-1 video stream to
MPEG-7 XML output. A relevance score is automatically assigned to the video shot
based on the confidence value of classification.
The system extracts two sets of visual features for each video shot. These two sets
are applied by two different modeling procedures. The first set includes: (1) color his-
togram (YCbCr, 8 · 3 · 3 dimensions), (2) auto-correlograms (YCbCr, 8 · 3 · 3dimensions), (3) edge orientation histogram (32 dimensions), (4) Dudani�s moment
invariants (6 dimensions), (5) normalized width and height of bounding box (2
dimensions), (6) co-occurrence texture (48 dimensions). These visual features are
all extracted from the keyframe. The other set of visual features include: (1) color
histogram (RGB, 8 · 8 · 8 dimensions, (2) color moments (7 dimensions), (3) coarse-
Fig. 4. System overview of VideoAL: (A) concept training process and model-building modules; (B)
detection process and labeling modules.
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 381
ness (1 dimension), (4) contrast (1 dimension), (5) directionality (1 dimension), and
(6) motion vector histogram (6 dimension). The first five features are extracted from
the keyframe. The motion features are extracted from every I and P frames of the
shot using the motion estimation method described in the region segmentation mod-
ule. The VideoAL system uses support vector machine (SVM) as the classificationmethods and ensemble fusion for classification. Detailed description of VideoAL
is shown in Lin et al. (2003).
5. Media middleware
The media middleware consists of the personalization engine and the adaptation
engine as illustrated in the system overview of Fig. 2. In the personalization engine,the user query and usage environment are matched with the media descriptions and
rights expressions to generate the personalized content. The adaptation engine deter-
mines the optimal variation (i.e., format, size, rate, and quality) of the content for the
user in accordance with the adaptability declarations and the inherent usage
environment.
5.1. Personalization engine
The objective of our system is to show a shortened video summary that maintains
as much semantic content within the desired time constraint. The VideoSue engine
performs this process by three different matching methods and is described next.
5.1.1. VideoSue—video summarization on usage environment
Our VideoSue stands for Video Summarization on Usage Environment. This en-
gine takes MPEG-7 metadata descriptions from our content sources along with the
MPEG-7/MPEG-21 user preference declarations and user time constraint to outputan optimized set of selected video segments, which will generate the desired person-
alized video summary (Tseng et al., 2002), as illustrated in Fig. 5. Using shot seg-
ments as the basic video unit, there are multiple methods of video summarization
based on spatial and temporal compression of the original video sequence. In our
work, we focus on the insertion or deletion of each video shot depending on user
preference.
Each video shot is either included or excluded from the final video summary.
In each shot, MPEG-7 metadata describes the semantic content and correspond-ing scores. Assume there are a total of N attribute categories. Let
P*
¼ ½p1; p2; . . . ; pN �Tbe the user preference vector, where pi denotes the preference
weighting for attribute i, 1 6 i 6 N. Assume there are a total of M shots. Let
S*
¼ ½s1; s2; . . . ; sM �T be the shot segments that comprise the original video sequence,
where si denotes shot number i, 1 6 i 6M. Subsequently, the attribute score ai,j is
defined as the relevance of attribute i in shot j, 1 6 i 6 M and 1 6 i 6 N. It then fol-
lows that the weighted attribute wi for shot i given the user preference P*
is calculated
as the dot product of the attribute matrix A and the user preference vector P*
:
Fig. 5. Overview of video summarization engine. Each video is segmented into shots that are annotated
with MPEG-7 descriptions. Using user�s preferences and profiles, the optimally matched set of video shots
is selected.
382 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
W*
¼
w1
w2
� � �wM
26664
37775 ¼
a1;1 a1;2 � � � a1;Na2;1 � � � � � � � � �� � � � � � ai;j � � �aM ;1 � � � � � � aM ;N
26666664
37777775
�
p1p2� � �pN
26664
37775;
wi specifies the relative weighted importance of shot i with respect to the other shots.
Assume shot si spans a durations of ti, 1 6 i 6 M. Consequently, shot si is included in
the summary if the importance weighting wi of this shot is greater than some thresh-
old, wi > h, and excluded otherwise. h is determined such that the sum of the shot
durations ti is less than the user specified time constraint. Consequently, each shot
is initially ranked according to its weighted importance and either included or ex-
cluded in the final personalized video summary according to the time constraint.
Furthermore, the VideoSue engine generates the optimal set of selected video shotsfor one video as well as across multiple video sources.
5.1.2. Video summarization using visual semantic annotations and automatic speech
transcriptions
The objective of our personalization and summarization is to show a shortened
video that captures the most relevant content within the desired time constraint.
We describe another summarization algorithm using visual annotations and speech
transcripts for summarization segment unit and user preference matching as pre-sented in Tseng and Lin (2002). Metadata describing our content are available as
annotated video shot from annotation tools and recognized speech word from auto-
matic speech recognition systems. Our video summary objective is to retrieve video
shots with corresponding coherent sentence segments.
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 383
Visual shot boundaries do not necessary corresponding to semantic boundaries im-
posed by an audio sentence. Thus audio visual alignment, abbreviated as AVA, is per-
formed to determine the optimal shot-to-sentence aligned boundaries. The resulting
AVA unit captures the shot-to-sentence relationships. For instance, there can bemany
short shots required to convey one complete concept delivered by the long sentence.Another example is when one video shot is too long and need to be further segmented.
Fig. 7 shows the three shot-to-sentence relationships, where V stands for video
and A for audio. For case (1) Short Shots and Long Sentence, there are many short
shots to convey one complete concept delivered by the long sentence. Thus, the sum-
mary AVA segment selects the span of the long sentence. For case (2) Long Shot and
Short Sentences, the shot is too long and can be further segmented. Consequently,
the summary AVA segment selects only the required span of one sentence. For case
(3) Similar Shot and Sentence Lengths, the summary AVA unit is clearly defined asthe continuous camera-captured shot.
Our personalized video summary is composed of video shots and coherent sen-
tence segments. Using the MPEG-7 descriptions from the shot-based video annota-
tions and aggregated sentence transcripts, the personalization engine matches these
metadata against the user preferences and selects the optimal set of summarization
AVA units. Subsequently, each AVA unit is excluded, included, or partially included
in the final personalized video summary. The final selection is dependent on the
ranked relevance of the user preference correlation against the AVA media descrip-tions according to the time constraint as presented earlier.
5.1.3. Video summarization using hierarchical context clustering
So far our video summaries are composed of a set of independent shots or AVA
units that optimally matches the user preferences while limited by the total time con-
straint. The resulting video summaries do not take the relative temporal segment
selections into consideration. Furthermore, the selected summary segments should
convey some coherence continuity, especially for neighboring segments that maybe part of a more complete story. As a result, selecting the optimal summary seg-
ments is also dependent on the relative context. In this section, we examine a new
summarization algorithm where some hierarchical context structure is imposed on
the video (Tseng and Smith, 2003).
Well-structured video content is hierarchical in nature. This is valid for most video
sequences that have defined program structures and general production formats. For
instance, news videos follow a program structure using categories like international
politics, local news, finance, sport, etc. The news categories are then broken up intospecific news stories. Furthermore, the news stories can be decomposed into AVA
units, and finally into individual camera shots, as illustrated in Fig. 6. Similarly, user
interests can be specified in any of these granularities or levels. As a result, we base our
video summarization algorithm on the hierarchical levels of the content metadata.
Starting with the basic shot segments of a video, let us denote the value v(si) asso-
ciated with shot si to refer to the derived user value of viewing shot si. The shot si is
associated with its shot duration d(si). To determine the optimal set of shots sj for a
video summary corresponds to solving the following problem:
Fig. 6. Hierarchical decomposition of a video sequence into category segments, story segments, AVA
units, and shots.
Fig. 7. Video shot and audio sentence alignment example.
384 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
Maximize the sum of the shot values :Xj
vðsjÞ;
Constraint by the sum of the shot durations :Xj
dðsjÞ6D:
As demonstrated by Merialdo et al. (1999), the above optimization requirements
result in the Knapsack problem that derived the heuristic solution using the value
density definition:
Value density ¼ vðsiÞ=dðsiÞ:The value density for each shot reflects a higher score for the shorter shots when
the shot values are similar, and thus will be more likely to be selected for the resulting
video summary. The value density is a good approximation in solving the N-P com-
plete Knapsack problem. However, the value density does not reflect coherence be-
tween neighboring shots or shots belonging to the same story.
Let us define a hierarchy of video segment relationships, like shots, stories, andcategories. For illustration, multiple shots can belong to one story, and many sto-
ries are grouped into a category. Shots reside in level 1, stories reside in level 2,
and categories in level 3. Let s(i,l) denote the video segment where shot si resides
in level l 2 [1,L]. It follows that s(i,l = 1) corresponds to shot si, s(i,l = 2) corre-
sponds to the story covering shot si, and finally s(i,l = 3) corresponds to the cate-
gory embracing shot si. Similarly, d(i,l) denote the duration of video segment s(i,l).
Subsequently, we can compute the relative hierarchical contribution in the
following:
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 385
Coherence value for shot si ¼XL
l¼1
BðlÞXj2ði;lÞ
vðjÞ;
where B(l) is the coherence function dependent on the variable level l. In our case, we
used a decaying level coherence B(l) = (0.5)l�1. Thus the coherence value aggregates
and spreads the value of the shots over to the neighboring hierarchical levels. Thecoherence value reflects the shots relationships required for a more coherent video
summary. To reflect the duration variable in the value score computation, we finally
introduce the following:
Coherence density for shot si ¼XL
l¼1
BðlÞXj2ði;lÞ
vðjÞ=dði; lÞ:
The coherence density captures both the coherence relationship and the value den-
sity. Fig. 8 illustrates an example that demonstrates the ranked selection of shots
dependent on the value density, coherence value, and coherence density. In this
example, there are 10 shots. Shots 1–8 have a duration of 1min, d(1–8) = 1, and shots9 and 10 have a duration of 1.5min, d(9) = d(10) = 1.5. Shots 2, 5, 9, and 10 have a
user value of 1.0, v(2) = v(5) = v(9) = v(10) = 1, and the other shots have a value of 0.
In addition, shots 1 and 2 compose story I, shots 3, 4, and 5 compose story II, shots 7
and 8 compose story III, and shots 9 and 10 compose story III. Furthermore, stories
1 and 2 form category I and stories 3 and 4 give us category II. Fig. 8 shows these
parameters in the top subplots.
In Fig. 8 (left), value density is computed for each shot and their scores are gen-
erated in the second subplot. The value density is then computed against the storydurations and these scores are aggregated with the shot value densities in the third
subplot. Next, the value density is computed against the next category level and
aggregated with the story and shot densities to generate the fourth subplot. As a
Fig. 8. An example of aggregated relevance scores for hierarchical shots, stories, and categories, as they
are computed for (left) value density, (middle) coherence value, and (right) coherence value density.
386 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
result, the ranking for the segments give the following order: shot 2, (5,9,10), 1,
(3,4), (7,8), and 6. Thus the highly valued shot 2 is the first shot selected.
Similarly for Fig. 8 [middle], coherence value is computed and aggregated for
shots [second subplot], stories [third subplot], and categories [fourth subplot]. The
resulting ranking generates the following segment order: shot (9,10), (2,5), (1,3,4),(7,8), and 6. Subsequently, the highly coherent shots 9 and 10 are bumped to the
top of the list.
Finally for Fig. 8 [right], coherence density is computed and aggregated for shots
[second subplot], stories [third subplot], and categories [fourth subplot]. The result-
ing ranking generates the following segment order: shot 2, (9,10), 5, 1,(3,4), (7,8), and
6. Here, shot 2 is still the first shot selected, but shots 9 and 10 are closely ranked
behind. In conclusion, coherence density reflects an effective balance between the
individual shot value contributions while taking into account the hierarchical coher-ence segments of neighboring shots.
5.2. Adaptation engine
The objective of the adaptation engine is to perform the optimal set of transfor-
mations on the selected content in accordance with the adaptability declarations and
the inherent usage environment. The adaptation engine must be equipped to perform
transcoding, filtering, and scaling such that the user can play the final adapted per-sonalized content. We have implemented two adaptation engines, called the VideoEd
and Universal Tuner.
5.2.1. VideoEd—real-time video editing and composition tool for content adaptation
Our VideoEd is a novel MPEG system layer compressed-domain editing and com-
position technique used to facilitate the delivery and integration of multiple segments
of MPEG files, residing on remote databases (Lin et al., 2002). Various multimedia
applications, including retrieval and summarization, split MPEG files into small seg-ments along shot boundaries and store them separately. This traditional method re-
quires extra management and storage payload, provides only fixed segmentations,
and may not play smoothly. In order to solve this problem, our adaptation engine
directly extracts the desired video-audio information from the original MPEG
sources and combines them to generate a single MPEG file. Manipulated wholly
in the system bitstream domain, this method does not require decoding, re-encoding,
and re-synchronization of audio and video data. Thus, it operates in real-time and
provides great flexibility. This composite MPEG file can be transmitted and dis-played through general Web interfaces.
Fig. 9 shows the flowchart of the editing tool, VideoEd. This tool takes three kinds
of input: MPEG files, Frame Map files, and Retrieval List. The first two are stored in
a database. Frame Map files, which indicate the accessible points of the pack header
of GOPs, are used to indicate the randomly accessible positions in an MPEG system
stream. The Retrieval List is the XML file generated by the personalization engine.
The highlighted area in Fig. 9 shows the case of concatenating multiple video-au-
dio segments from an MPEG file to generate a new one. The first step is to copy and
Fig. 9. Basic structure of VideoEd system compressed-domain editing and composition tool.
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 387
transfer the first one or two packs to a new file. The number of copied packs depends
on whether the global information of video and audio are stored separately and the
number of packets in a pack. Next, we need to extract video and audio sequence
headers. In an MPEG-1 bitstream, there can be more than one video sequence head-
er (VSH). In most cases there is either one VSH for the whole video sequence or mul-tiple VSHs before each GOP. Note that a VSH has to be placed immediately before a
GOP. Audio sequence headers have the same properties as that of video.
The content of audio and video data is copied and pasted in the pack levels.
According to the retrieval list, we then choose the accessible positions indicated by
the Frame Map files. Note that the resolution of segmentation has to be in the
GOP. In other words, video and audio can only be cut and pasted at a resolution
of nearly 0.5s. After this step, we need to clean up the bits that belong to the previ-
ous or next GOP. If there are multiple packets in a pack, we delete the unnecessarypackets. After cleaning up, we insert the video sequence header to the beginning of
the first video segments. Similar cleaning process is performed for audio packets.
The final step of VideoEd is the manipulation of time codes. They must be done in
the system pack and packet layers. They new sequence is regarded valid regardless of
the time code changes at the GOP headers. To concatenate two clips, we have to cal-
culate a temporal offset from the difference of the first SCR in the new clip to the
SCR of the last clip, and subtract it from the SCR of all the packs. We have to
change PTS and DTS of all the packets in the second clip using a similar strategy.
5.2.2. Universal Tuner—format and scale transcoder for pervasive devices
The other component in our adaptation is the Universal Tuner that serves as an
adaptive transcoder for pervasive devices (Han et al., 2001). In the Universal Tuner,
the complexity of multimedia compression and decompression algorithms is adap-
tively partitioned between the encoder and decoder. A mobile client would selectively
388 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
disable or re-enable stages of the algorithm to adapt to the device�s effective process-ing capability. Our variable-complexity strategy of selective disabling of modules
supports graceful degradation of the complexity of multimedia coding and decoding
into a mobile client�s low-power mode, i.e., the clock frequency of its next-generation
low power CPU has been scaled down to conserve power.The input of the transcoder is a stream of video frames of MPEG-1/2/4 file. The
video frames are first decoded and resampled to either 80 · 80 or 160 · 160 image.
Then, depending on whether the PDA device is Black and White or 256 colors (as
in Palm IIIc), the color RGB frames are either dithered using halftoning or mapped
to 256 colors using a shot-based optimal color map. In the color mode, we added a
shot boundary detection functionality in order to update the color codebook for
each new shot. After this step, the transcoder can selectively to enable an entropy
coding process and a frame differencing process to further reduce the bit rates.
6. Personalization applications and user clients
The user client allows the user to retrieve specific content via a user query, provide
critical persistent data through the usage environment, and receive the personalized
content on the user�s display client. The usage environment holds the profiles about
the user, device, network, delivery, and other environments. This usage environmentdescriptions are sent over to the personalization engine of the media middleware,
and may include some user profile (i.e., English speaking American, user residing
in New York City), device profile (i.e., Palm IIIc with limited color display, no audio
capabilities, and memory capacity), and transmission profile (i.e., internet access
through 56K modem.) They allow the middleware to dynamically perform the
appropriate content adaptation on the requested media. Thus the delivered multime-
dia can be personalized for the usage environment.
The user submits a request for some content via a user query. The query can takethe form of topic preferences, search keywords, media samples, and/or time con-
straints. The user query then initiates the retrieval for customized content by the
media middleware. Following, the personalized multimedia is delivered from the
adaptation engine of the middleware to the user and rendered on the display client
in accordance with the usage environment and the user query. If the query is in the
form of free keywords, however, in our current implementations, only the keywords
that match (manual or automatic) annotations can be retrieved. In the rest of this
section, we present two personalization applications.
6.1. User clients on PCs using IBM Websphere Portal Server
The IBM Websphere Portal Server (WPS) product provides a secure, single point
of access to diverse information and applications, personalized to the needs of their
users. The WPS uses portals to present users with personalized content according to
authenticated user profiles. WPS can also be deployed on many categories of portals
that allow access from a wide variety of desktops and mobile devices. This frame-
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 389
work provides us with an open, flexible, and scalable infrastructure for creating and
deploying our rich media personalization system with usage environment descrip-
tions. Within WPS, we have developed the personalization and summarization
portal.
When a user establishes an account with the WPS, the user profiles are created andcan be modified later. In our portal, the preference interests of the user are gathered.
This includes the topic category and default viewing time. Fig. 10 illustrates the user
client portals for three different topic categories of News, Entertainment, and Educa-
tion. Thus the appropriate portal page according to the user profile would automat-
ically be presented when the user logs on to its portal. This illustrates the use of
persistent usage environment within WPS. Assuming our user has stored the desired
topic category to be Entertainment. Then Fig. 11 (left) shows the corresponding portal
interface of the Late Night Show, which is ranked in the Entertainment category.
Fig. 10. User client portals corresponding to three different usage environment profiles for: (1) news (left),
(2) entertainment (middle), and (3) education (right).
Fig. 11. The usage example and screen shots of the personalized video summarization for user clients on
PC.
390 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
After our user enters the portal page, she can request for a personalized video
summary according to her current interests. There are two preferences and the time
limit for her selection. For the Late Night Show, the preference choices range from
‘‘David_Letterman,’’ ‘‘Paul_Shaffer,’’ to ‘‘Monologue,’’ ‘‘Guest_Interview,’’ and
‘‘Top_10.’’ Our user submits the user query preferences: (1) ‘‘Guest_Introduction’’and (2) ‘‘David_Letterman,’’ along with a time constraint of 120s. As a result, the
VideoSue summarization and personalization engine retrieves the optimal set of
highly relevant video segments that included David Letterman introducing many
guests from three video sources. Subsequently, the VideoEd composition tool gathers
this set of selected video clips and dynamically generates one personalized video mo-
vie. Fig. 11 (right) depicts four screen shots of guests as they are introduced.
6.2. User clients on personal digital assistance (PDA) devices
Using the Universal Tuner adaptation engine, we can transcode a summarized vi-
deo and transmit it to pervasive devices. The client can access personalized video
with one of these three modes: (a) video-on-demand with interactive hyperlinks,
(b) summarized video based on preference topics and time constraint, and (c) sum-
marized video based on query keywords and time constraint. Fig. 12 illustrates the
user client in mode: (a) on the left, mode (b) in the middle, and mode (c) on the right.
A user can first click on the Get Channels button to get a list of hyper-links accord-ing to the stored user profile. This list is then displayed on the left of the video play-
back window. He can then select to view the video by clicking on the hyperlinks. In
another scenario, he may then want to see a summarized video based on the prefer-
ence choices that were sent from the middleware when he selects a video. Then, based
on his preferences and time constraint, he will receive a summarized video, which
best matches the user preferences and profiles. In addition to the preferences that
were recommended by the middleware, users could also key in some keywords
and let the middleware find the best matches and generate the personalized videosummary.
Fig. 12. The user client interface on the palm OS PDA for three personalized video scenarios: (A) video-
on-demand with interactive hyperlinks, (B) summarized video based on preference topics and time
constraint, and (C) summarized video based on query keywords and time constraint.
B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392 391
7. Conclusion
A video personalization and summarization system is presented matching media
descriptions with usage environments in a server–middleware–client framework in
order to deliver personalized media contents to users. The three-tier architecture pro-vides a standards-compliant infrastructure using our tools and engines to select,
adapt, and deliver personalized video summaries to the users effectively. The Video-
AnnEx MPEG-7 Video Annotation Tool and VideoAl Automatic Labeling Tool al-
lows annotators to describe video segments using semantic concepts. The VideoSue
Summarization on Usage Environment Engine determines the optimal selection of
media contents according to user preferences and time constraint. The VideoEd Edit-
ing and Composition Tool and Universal Tuner Tool gather the selected contents and
generates one personalized video for the user. Our personalization and summariza-tion systems are demonstrated on top of the IBM Websphere Portal Server (WPS)
and for PDA devices. In the future, we will add the functionalities that allow users
to input any query keywords, and then let VideoSue to match them with the existing
ones according their semantic distance. We will also test the performance of other
possible fusion methods, in addition to the weighting method that is current used.
We will also work on increasing the coverage and accuracy of the concept models
in VideoAl, and conduct more experiments on the subjective user satisfaction studies.
References
Available from: <http://www.ibm.com>.
Available from: <http://www.nttdocomo.co.jp>.
Available from: <http://www.virage.com>.
Aizawa, K., Shijima, K.I., Shiina, M., 2001. Summarizing wearable video. IEEE ICIP, Thessaloniki,
Greece, October 2001.
Amir, A., Ponceleon, D., Blanchard, B., Petkovic, D., Srinivasan, S., Cohen, G., 2000. Using audio time
scale modification for video browsing. In: Hawaii International Conference on System Sciences,
HICSS-33, Maui, January.
Butler, M.H., 2001. Implementing content negotiation using CC/PP and WAP UAProf, Technical Report
HPL-2001-190, June 2001. Available from: <http://www.hpl.hp.com/techreports/2001/
HTTPCCPP_2001-190.html>.
Gong, Y., Liu, X., 2001. Summarizing video by minimizing visual content redundancies. IEEE ICME,
Tokyo, August 2001.
Han, R., Lin, C.-Y., Smith, J.R., Tseng, B.L., Ha, V., 2001. Universal tuner: a video streaming system for
CPU/power-constrained mobile devices. ACM Multimedia 2001, Ottawa, Canada.
ISO/IEC JTC 1/SC 29/WG 11/N 4242, Text of 15938-5 FDIS, Information Technology—Multimedia
Content Description Interface—Part 5 Multimedia Description Schemes, Final Document Interna-
tional Standard (FDIS) edition, October 2001.
ISO/IEC JTC1/SC29/WG11/M8321, MPEG-7 Tools for MPEG-21 Digital Item Adaptation, Fairfax, VA,
May 2002.
ISO/IEC JTC1/SC29/WG11/N5354, MPEG-21 Digital Item Adaptation, Awaji Island, Japan, December
2002.
Lin, C.-Y., Tseng, B.L., Smith, J.R., 2002. Universal MPEG content access using compressed-domain
system stream editing techniques. In: IEEE International Conference on Multimedia and Expos,
Switzerland, August.
392 B.L. Tseng et al. / J. Vis. Commun. Image R. 15 (2004) 370–392
Lin, C.-Y., Tseng, B.L., Smith, J.R., 2003. VideoAnnEx: IBM MPEG-7 annotation tool. In: IEEE
International Conference on Multimedia and Expo, Baltimore.
Lin, C.-Y., Tseng, B.L., Naphade, M., Natsev, A., Smith, J.R., 2003. VideoAL: a novel end-to-end
MPEG-7 video automatic labeling system. In: IEEE International Conference on Image Processing,
Barcelona, September.
Ma, Y.F., Zhang, H.J., 2002. A model of motion attention for video skimming. International Conference
on Image Processing, Rochester, New York, September.
Manjunath, B.S., Salembier, P., Sikora, T., 2002. Introduction to MPEG-7: Multimedia Content
Description Language. Wiley, New York.
Merialdo, B., Lee, K.T., Luparello, D., Roudaire, J., 1999. Automatic construction of personalized TV
news programs. In: ACM Multimedia 1999, Orlando, FL, September.
Sundaram, H., Chang, S.F., 2000. Determining computable scenes in films and their structures using
audio-visual memory models. ACM Multimedia 2000, Los Angeles, CA.
Tseng, B.L., Lin, C.-Y., Smith, J.R., 2002. Video summarization and personalization for pervasive mobile
devices. In: SPIE Electronic Imaging 2002—Storage and Retrieval for Media Databases, San Jose,
January.
Tseng, B.L., Lin, C.-Y., 2002. Personalized video summary using visual semantic annotations and
automatic speech transcriptions. In: IEEE Multimedia Signal Processing MMSP 2002, St. Thomas,
December.
Tseng, B.L., Smith, J.R., 2003. Hierarchical video summarization based on context clustering. In:
International Symposium on ITCom 2003—Internet Multimedia Management Systems, Orlando, FL.
Yeung, M., Yeo, B., Wolf, W., Liu, B. Video browsing using clustering and scene transitions on
compressed sequences. In: Proceedings of the SPIE on Multimedia Computing and Networking, vol.
2417, 1995.