INTERNATIONAL ORGANIZATION FOR …read.pudn.com/downloads75/doc/279385/15938/Multimedia... · Web viewThe TimeUnit is specified as 1N which is 1/30 of a second according to 30F. But

INTERNATIONAL ORGANIZATION FOR STANDARDIZATIONORGANISATION INTERNATIONALE NORMALISATION

ISO/IEC JTC 1/SC 29/WG 11CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC 1/SC 29/WG 11/M6155July 2000, Beijing

Source: MDSTitle: MPEG-7 Multimedia Description Schemes XM (Version 3.1)Status: DraftEditors: Peter van Beek, Ana B. Benitez, Joerg Heuer, Jose Martinez, Philippe

Salembier, John Smith, Toby Walker

Foreword:

The MPEG-7 standard also known as "Multimedia Content Description Interface" aims at providing standardized core technologies allowing description of audiovisual data content in multimedia environments. This is a challenging task given the broad spectrum of requirements and targeted multimedia applications, and the broad number of audiovisual features of importance in such context.

In order to achieve this broad goal, MPEG-7 will standardize: Descriptors (D): representations of Features, that define the syntax and the semantics of each feature representation, Description Schemes (DS), that specify the structure and semantics of the relationships between their components,

which may be both Ds and DSs, A Description Definition Language (DDL), to allow the creation of new DSs and, possibly, Ds and to allows the

extension and modification of existing DSs, System tools, to support multiplexing of description, synchronization issues, transmission mechanisms, file format,

etc.

The standard will be subdivided into seven parts: 1. Systems: Architecture of the standard, tools that are needed to prepare MPEG-7 Descriptions for efficient transport

and storage, and to allow synchronization between content and descriptions. Also tools related to managing and protecting intellectual property.

2. Description Definition Language: Language for defining new DSs and perhaps eventually also for new Ds, binary representation of DDL expressions.

3. Visual: Visual elements (Ds and DSs).4. Audio: Audio elements (Ds and DSs).5. Multimedia Description Schemes: Elements (Ds and DSs) that are generic, i.e. neither purely visual nor purely

audio. 6. Reference Software: Software implementation of relevant parts of the MPEG-7 Standard.7. Conformance: Guidelines and procedures for testing conformance of MPEG-7 implementations.

This document contains the elements of the Multimedia Description Schemes (MDSs) part of the standard that are currently under consideration (part 5). This document defines the MDS eXperimentation Model (XM). It addresses both normative and non-normative aspects. Once an element is included in the Working Draft (WD), its normative elements are moved from the MDS XM document to the MDS WD document and only the non-normative elements associated with the D or DS remain in this document. In order to make easier the reading, a set of hyperlinks between the MDS XM and the MDS WD documents have been created. These links assume that both documents are in the same directory with their official names: MDS XM file: M6155.doc, MDS WD file: M6156.doc.

Current MDS XM document: Version 3.1 M6155 (Beijing, July 2000 ) Current MDS WD document: Version 3.1 M6156 (Beijing, July 2000 )The syntax of the descriptors and DSs is defined using the follwing DDL WD: DDL WD document: Version 3.0 N3391 (Geneva, May 2000)

The Syntax defined in this document assumes the following Schema Wrapper

<schema xmlns="http://www.w3.org/1999/XMLSchema" xmlns:mds=http://www.example.com/???targetNamespace=http://www.example.com/???elementFormDefault="unqualified" attributeFormDefault="unqualified">

http://www.example.com/???

http://www.example.com/???

Table of contents

List of Figures...................................................................................................................................47

1 Introduction................................................................................................................................491.1 Organization of the document............................................................................................................................491.2 Overview of MDSs.............................................................................................................................................49

2 Normative references...............................................................................................................411

3 Terms, definitions, symbols, abbreviated terms...................................................................412

4 Datatypes and basic structures...............................................................................................4134.1 Integer Datatypes..............................................................................................................................................413

4.1.1 Datatype Syntax........................................................................................................................................4134.1.2 Datatype Semantics..................................................................................................................................413

4.2 Vectors and matrices........................................................................................................................................4134.2.1 Vector.......................................................................................................................................................4134.2.2 Matrix.......................................................................................................................................................4144.2.3 Diagonal Matrix........................................................................................................................................415

4.3 Probability Datatypes.......................................................................................................................................4164.3.1 Probability Datatype.................................................................................................................................4164.3.2 Confidence Datatype................................................................................................................................4164.3.3 Probability Vector Datatype.....................................................................................................................4174.3.4 Probability Matrix Datatype.....................................................................................................................417

4.4 Histograms........................................................................................................................................................4184.4.1 Histogram D.............................................................................................................................................418

4.5 Quantizers.........................................................................................................................................................4184.5.1 Uniform Quantization D...........................................................................................................................418

5 Link to the media and Localization........................................................................................4205.1 References to Ds and DSs................................................................................................................................420

5.1.1 Reference D..............................................................................................................................................4205.1.2 ReferenceToSegment D............................................................................................................................4205.1.3 ReferenceToProgram D............................................................................................................................420

5.2 Unique Identifier...............................................................................................................................................4215.2.1 Uidentifier D.............................................................................................................................................421

5.3 Time elements...................................................................................................................................................4215.3.1 TimePoint D.............................................................................................................................................4225.3.2 Duration D................................................................................................................................................4225.3.3 IncrDuration D..........................................................................................................................................4235.3.4 RelTimePoint D........................................................................................................................................4235.3.5 RelIncrTimePoint D.................................................................................................................................4245.3.6 Time DS....................................................................................................................................................4245.3.7 MediaTimePoint D...................................................................................................................................4255.3.8 MediaDuration D......................................................................................................................................4255.3.9 MediaIncrDuration D...............................................................................................................................4265.3.10 MediaRelTimePoint D..............................................................................................................................4265.3.11 MediaRelIncrTimePoint D.......................................................................................................................4275.3.12 MediaTime DS.........................................................................................................................................427

5.4 Media locators..................................................................................................................................................4295.4.1 MediaURL D............................................................................................................................................4295.4.2 MediaLocator DS.....................................................................................................................................4305.4.3 VideoSegmentLocator DS........................................................................................................................4305.4.4 ImageLocator DS......................................................................................................................................4315.4.5 AudioSegmentLocator DS........................................................................................................................4315.4.6 SoundLocator DS.....................................................................................................................................431

6 Basic elements..........................................................................................................................4336.1 Textual description...........................................................................................................................................433

6.1.1 Language attribute....................................................................................................................................4336.1.2 Language Datatype...................................................................................................................................4336.1.3 ControlledTerm D.....................................................................................................................................4346.1.4 TextualDescription Datatype....................................................................................................................435

6.1.5 StructuredAnnotation DS.........................................................................................................................4356.2 Description of persons......................................................................................................................................436

6.2.1 Person DS.................................................................................................................................................4366.2.2 Person Name Datatype.............................................................................................................................4376.2.3 Individual DS............................................................................................................................................4386.2.4 PersonGroup DS.......................................................................................................................................4386.2.5 Organization DS.......................................................................................................................................439

6.3 Description of places........................................................................................................................................4396.3.1 Place DS...................................................................................................................................................439

6.4 Description of importance or priority...............................................................................................................4406.4.1 Weight DS................................................................................................................................................440

6.5 Entity-relationship graph..................................................................................................................................4426.5.1 EntityRelationshipGraph DS....................................................................................................................442

6.6 Description of Time Series...............................................................................................................................4436.6.1 TemporalInterpolation D..........................................................................................................................443

6.7 Scalable Series..................................................................................................................................................4476.7.1 SeriesOfScalarType;.................................................................................................................................4486.7.2 SOSScaledType........................................................................................................................................4486.7.3 SOSFixedScaleType.................................................................................................................................4506.7.4 SOSFreeScaleType...................................................................................................................................4526.7.5 SOSUnscaledType....................................................................................................................................4536.7.6 SeriesOfVectorType.................................................................................................................................4546.7.7 SOVScaledType.......................................................................................................................................4556.7.8 SOVFixedScaleType................................................................................................................................4566.7.9 SOVFreeScaleType..................................................................................................................................4586.7.10 SOVUnscaledType...................................................................................................................................4586.7.11 Additional Informative Text.....................................................................................................................459

7 Description of the media..........................................................................................................4787.1.1 MediaIdentification DS............................................................................................................................4787.1.2 MediaFormat DS......................................................................................................................................4787.1.3 MediaCoding DS......................................................................................................................................4797.1.4 MediaInstance DS.....................................................................................................................................4797.1.5 MediaTranscodingHints DS.....................................................................................................................4807.1.6 MotionHint DS.........................................................................................................................................4827.1.7 MediaProfile DS.......................................................................................................................................4847.1.8 MediaInformation DS...............................................................................................................................484

8 Description of the content creation & production................................................................4868.1.1 Creation DS..............................................................................................................................................4868.1.2 Classification DS......................................................................................................................................4878.1.3 MediaReview DS......................................................................................................................................4878.1.4 RelatedMaterial DS..................................................................................................................................4908.1.5 Creation MetaInformation DS..................................................................................................................490

9 Description of the content usage.............................................................................................4939.1.1 Rights DS..................................................................................................................................................4939.1.2 UsageRecord DS.......................................................................................................................................4939.1.3 Financial DS.............................................................................................................................................4949.1.4 UsageMetaInformation DS.......................................................................................................................494

10 Description of the structural aspects of the content.............................................................49610.1 Segment............................................................................................................................................................496

10.1.1 Segment DS..............................................................................................................................................49610.1.2 VideoSegment DS....................................................................................................................................49610.1.3 StillRegion DS........................................................................................................................................410010.1.4 MovingRegion DS..................................................................................................................................410210.1.5 VideoText DS.........................................................................................................................................410710.1.6 AudioSegment DS..................................................................................................................................4112

10.2 Segment Features............................................................................................................................................411310.2.1 MediaTimeMask DS...............................................................................................................................411310.2.2 Mosaic DS..............................................................................................................................................411310.2.3 MatchingHint DS....................................................................................................................................41151.1.1 PointOfView DS.....................................................................................................................................4118

10.3 SegmentRelationship graph............................................................................................................................412010.3.1 SegmentRelationshipGraph DS..............................................................................................................412010.3.2 Type of relationships..............................................................................................................................4127

11 Description of the conceptual aspects of the content..........................................................413011.1.1 Affective DS...........................................................................................................................................4130

12 Content navigation and access..............................................................................................413312.1 Summarization................................................................................................................................................4133

12.1.1 Summarization DS..................................................................................................................................413312.1.2 Summary DS...........................................................................................................................................413312.1.3 HierarchicalSummary DS.......................................................................................................................413312.1.4 HighlightLevel DS..................................................................................................................................413412.1.5 HighlightSegment DS.............................................................................................................................413512.1.6 SequentialSummary DS..........................................................................................................................414312.1.7 FrameProperty DS..................................................................................................................................414312.1.8 SoundProperty DS..................................................................................................................................414712.1.9 TextProperty DS.....................................................................................................................................4150

12.2 Partitions and decompositions........................................................................................................................415212.2.1 View DS.................................................................................................................................................415212.2.2 Space View DS.......................................................................................................................................415312.2.3 Frequency View DS................................................................................................................................415412.2.4 SpaceFrequency View DS......................................................................................................................415512.2.5 Resolution View DS...............................................................................................................................415612.2.6 SpaceResolution View DS......................................................................................................................415812.2.7 Filter DS.................................................................................................................................................415912.2.8 1D/2D-Filter DS.....................................................................................................................................416012.2.9 ViewSet DS............................................................................................................................................416012.2.10 SpaceTree DS.........................................................................................................................................416212.2.11 FrequencyTree DS..................................................................................................................................416612.2.12 SpaceFrequencyGraph DS......................................................................................................................416912.2.13 VideoViewGraph DS..............................................................................................................................417212.2.14 MultiResolutionPyramid DS..................................................................................................................4174

12.3 Description of variation of the content...........................................................................................................417512.3.1 Variation DS...........................................................................................................................................4175

13 Organization of the content..................................................................................................418113.1 Collections .....................................................................................................................................................4181

13.1.1 Collection Structure DS..........................................................................................................................418113.2 Models............................................................................................................................................................4186

13.2.1 Model DS................................................................................................................................................418613.3 Probability Models.........................................................................................................................................4187

13.3.1 ProbabilityModel DS..............................................................................................................................418713.3.2 Gaussian DS...........................................................................................................................................4188

13.4 Analytic Models.............................................................................................................................................418913.4.1 AnalyticModel DS..................................................................................................................................419013.4.2 Cluster DS...............................................................................................................................................419013.4.3 Examples DS..........................................................................................................................................419113.4.4 ProbabilityModelClass DS.....................................................................................................................4192

13.5 Classifiers.......................................................................................................................................................419313.5.1 Classifier DS...........................................................................................................................................419313.5.2 ClusterSet DS.........................................................................................................................................419313.5.3 ExamplesSet DS.....................................................................................................................................419413.5.4 ProbabilityModelClassifier DS...............................................................................................................4196

14 User Interaction.....................................................................................................................419814.1 User Preferences.............................................................................................................................................4198

14.1.1 UserPreference DS.................................................................................................................................419814.1.2 UserIdentifier DS....................................................................................................................................419914.1.3 PreferenceType DS................................................................................................................................419914.1.4 UsagePreferences DS.............................................................................................................................420014.1.5 BrowsingPreferences DS........................................................................................................................420114.1.6 SummaryPreferences DS........................................................................................................................420214.1.7 FilteringAndSearchPreferences DS........................................................................................................4203

14.1.8 ClassificationPreferences DS.................................................................................................................420514.1.9 CreationPreferences DS..........................................................................................................................420714.1.10 SourcePreferences DS............................................................................................................................4209

15 Bibliography...........................................................................................................................4210

16 Annex 1: Schema definition..................................................................................................4212

List of FiguresFigure 1: Overview of the MDSs.......................................................................................................................................49Figure 2: Real Data and Interpolation Functions..............................................................................................................443Figure 3: Sequential key points selection and interpolation calculation method: If the error of is greater

than the threshold, then let be a key point and fix ..........................................................446Figure 4: The Function Blocks of the Scene Change Detection Algorithm.....................................................................497Figure 5: Motion Vector Ratio In B and P Frames...........................................................................................................498Figure 6: Inverse Motion Compensation of DCT DC......................................................................................................498Figure 7: Outline of segment tree creation.....................................................................................................................4100Figure 8: Example of Binary Partition Tree creation with a region merging algorithm................................................4101Figure 9: Examples of creation of Binary Partition Tree with color and motion homogeneity criteria.........................4101Figure 10: Example of partition tree creation with restriction imposed with object masks...........................................4101Figure 11: Example of restructured tree.........................................................................................................................4102Figure 12: General Structure of AMOS.........................................................................................................................4103Figure 13: Object segmentation at starting frame...........................................................................................................4103Figure 14: Automatic semantic object tracking..............................................................................................................4104Figure 15: The video object query model.......................................................................................................................4107Figure 16: Separation of text foreground from background...........................................................................................4112Figure 17: Mosaic for MPEG-7 test sequence 'Clinton' over 183 frames, affine model................................................4114Figure 18: Mosaic of "Parliament" sequence constructed from 120 frames, equal to 30 frames/sec...........................4115Figure 19: Mosaic of "Parliament" sequence constructed from 12 frames, equal to 3 frames/sec................................4115Figure 20: Examples of spatio-temporal relationship graphs.........................................................................................4120Figure 21: Nature documentary......................................................................................................................................4121Figure 22: Key frame of a video shot capturing a goal in a soccer game.......................................................................4122Figure 23: Syntactic relationships among "Goal-sg" segment, "Kick-sg" segment, "Not-Catch-sg" segment, and "Goal-

rg" region for Figure 22..........................................................................................................................................4122Figure 24: Syntactic relationships among "Kick-sg" segment, "Not-Catch-sg" segment, "Goal-rg" region, "Forward-rg"

region, "Ball-rg" region, and "Goalkeeper-rg" region for Figure 22......................................................................4123Figure 25: Nature documentary......................................................................................................................................4126Figure 26: An image with associated 2D-String (rabbit < cloud < car = airplane, rabbit = car < airplane < cloud).....4126Figure 27: Pairwise clustering for hierarchical key-frames summarization in Algorithm B. In this example, the

compaction ratio is 3. First T1 is adjusted in (a) considering only the two consecutive partitions at either side of T1. Then T2 and T3 are adjusted as depicted in (b) and (c), respectively......................................................................4137

Figure 28: Shot boundary detection and key-frame selection in Algorithm C...............................................................4139Figure 29: Example tracking result (frame numbers 620, 621, 625). Note that many feature points disappear during the

dissolve, while new feature points appear..............................................................................................................4139Figure 30: Activity change (top). Segmented signal (bottom).......................................................................................4140Figure 31: An example of a key-frame hierarchy..........................................................................................................4140Figure 32: An example of the key-frame selection algorithm based on fidelity values.................................................4142Figure 33: Illustration of smart quick view....................................................................................................................4146Figure 34: Synthesizing frames in a video skim from multiple regions-of-interest.......................................................4146Figure 35: Aerial image (a) source: Aerial image LB_120.tif, and (b) a part of image a) based on a spatial view DS.4153Figure 36: Frequency View of an Aerial image – spatial-frequency subband...............................................................4154Figure 37: Example SpaceFrequency view of Figure 35 using a high resolution for the region of interest and a reduced

resolution for the context........................................................................................................................................4155Figure 38: Example view of image with reduced resolution..........................................................................................4157Figure 39: Aerial image (a) source: Aerial image LB_120.tif, and (b) a part of image a) based on a spatial view DS.4158Figure 40: Example View Set with a set of Frequency Views that are image subbands. This View Set is complete and

nonredundant..........................................................................................................................................................4161Figure 41: Example of Space and Frequency Graph for 2-D images. The graph organizes views of images in space and

frequency such as low-resolution views, high-resolution spatial views, low-resolution spatial views, and frequency views.......................................................................................................................................................................4170

Figure 42: Example of Video View Graph. (a) Basic spatial- and temporal-frequency decomposition building block, (b) Example video view graph of depth three in spatial- and temporal-frequency................................................4172

Figure 43: Shows a selection screen in which the user specifies the terminal device and network characteristics in terms of screen size, screen color, supported frame rate, bandwidth and supported modalities (image, video, audio)...4176

Figure 44: Shows the resulting selection of variations of a video news program under different terminal and network conditions. The high-rate color variation is selected for capable terminals, where the low-resolution grayscale variation is selected for more constrained terminals..............................................................................................4176

Figure 45: A generic usage model for user preference and media descriptions.............................................................4199Figure 46: Personalized filtering, search and browsing of audio-visual content............................................................4201

1 Introduction

1.1 Organization of the documentThis document describes the MDS elements under consideration of part 5 of the MPEG-7 standard. In the sequel, each element is described by several sections: Syntax: Normative DDL specification of the Ds or DSs. Semantic: Normative definition of the semantic of all the components. Extraction: Example of strategies such as extraction, analysis and indexing algorithms that may be used to

intantiate the corresponding element. This description is non-normative. Example: Example of instantiation of the Ds or DSs. Use: Example of search, browsing, retrieval, filtering tools related the corresponding element. This

description is non-normative.

1.2 Overview of MDSs The elements, Ds and DSs, described in this document are mainly structured on the basis of the functionality they provide. An overview of the structure is described in Figure 1.

Figure 1: Overview of the MDSs

At the lower level, basic elements can be found. They deal with basic datatypes, mathematical structures, linking and media localization tools as well as basic DSs, that are found as elementary components of more complex DSs. Based on this lower level, content management & description elements can be defined. These elements describe the content from several viewpoints. Currently five viewpoints are defined: Creation & Production, Media, Usage, Structural aspects and Conceptual aspects. The three first elements address primarily information related to the management of the content (content management) whereas the two last ones are mainly devoted to the description of perceivable information (content description). The following table defines more precisely the functionality of each set of elements:

Set of Elements FunctionalityCreation & Production Meta information describing the creation and production of the content: typical features

include title, creator, classification, purpose of the creation, etc. This information is most of the time author generated since it cannot be extracted from the content.

Usage Meta information related to the usage of the content: typical features involve rights holders, access right, publication, and financial information. This information may very likely be subject to change during the lifetime of the AV content.

Media Description of the storage media: typical features include the storage format, the encoding of the AV content, elements for the identification of the media. Note that several instances

Set of Elements Functionalityof storage media for the same AV content can be described.

Structural aspects Description of the AV content from the viewpoint of its structure: the description is structured around segments that represent physical spatial, temporal or spatio-temporal components of the AV content. Each segment may be described by signal-based features (color, texture, shape, motion, audio features) and some elementary semantic information.

Conceptual aspects Description of the AV content from the viewpoint of its conceptual notions. (Note that currently this part of the MDS is still under Core Experiment and no elements are included in the XM or WD).

The five sets of elements are presented here as separate entities. As will be seen in the sequel, they are interrelated and may be partially included in each other. For example, Media, Usage or Creation & Production elements can be attached to individual segments involved in the structural description of the content. Depending on the application, some areas of the content description will have to be emphasized and other may be minimized or discarded.

Beside the direct description of the content provided by the five sets of elements described in the previous table, tools are also defined for navigation and access: Browsing is supported by the summary elements and information about possible variations of the content is also given. Variations of the AV content can replace the original, if necessary, to adapt different multimedia presentations to the capabilities of the client terminals, network conditions or user preferences. Another set of tools (Content organization) addresses the organization of the content by classification, by the definition of collections and by modeling. Finally, the last set of tools specified in User Interaction describes user’s preferences pertaining to consumption of multimedia material.

2 Normative references

The following ITU-T Recommendations and International Standards contain provisions, which, through reference in this text, constitute provisions of ISO/IEC 15938. At the time of publication, the editions indicated were valid. All Recommendations and Standards are subject to revision, and parties to agreements based on ISO/IEC 15938 are encouraged to investigate the possibility of applying the most recent editions of the standards indicated below. Members of ISO and IEC maintain registers of currently valid International Standards. The Telecommunication Standardization Bureau maintains a list of currently valid ITU-T Recommendations. ISO 8601: Data elements and interchange formats -- Information interchange -- Representation of dates and times. ISO 639: Code for the representation of names of languages. ISO 3166-1: Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes ISO 3166-2: Codes for the representation of names of countries and their subdivisions -- Part 2: Country

subdivision code.

3 Terms, definitions, symbols, abbreviated termsFor the purposes of this International Standard, the terms and definitions given in the following apply:

AV: Audio VisualCIF: Common Intermediate FormatD: DescriptorDCT: Discrete Cosine TransformDDL: Description Definition LanguageDS: Descripton SchemeIANA: Internet Assigned Numbers AuthorityJPEG: Joint Photographic Experts GroupMDS: Multimedia Description SchemeMPEG: Moving Picture Experts GroupMP3: MPEG1/2 layer 3 (audio conding)QCIF: Quarter Common Intermediate FormatSMPTE: Society of Motion Picture and Television EngineersTZD: Time ZoneURI: Uniform Resource Identifier (IETF Standard is RFC 2396)URL: Uniform Resource Locator (IETF Standard is RFC 2396)XML: Extensible Markup Language

4 Datatypes and basic structures

4.1 Integer DatatypesThese datatypes constrain an integer value to lie within a range that can be represented within a fixed number of bits.

4.1.1 Datatype SyntaxThe normative components associated with this element are specified in the Working Draft associated with the current version of this document. (see Introduction).

4.1.2 Datatype SemanticsThe normative components associated with this element are specified in the Working Draft associated with the current version of this document. (see Introduction).

4.2 Vectors and matricesThese datatypes represent arbitrary-sized vectors or matrices of integer or real numbers.

4.2.1 Vector A vector datatype represents a one-dimensional array of either integer or real values.

4.2.1.1 Descriptor SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

4.2.1.2 Descriptor SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

4.2.1.3 Description ExtractionNot applicable.

4.2.1.4 Description Example

If a "MyVector1" element is declared as:

<element name="MyVector1" type="IntegerVector"/>

an instance of this element can be written as follows:

<MyVector1 Size="5">1 20 300 4000 50000</MyVector1>

Note that in this declaration the range of vector element is unbounded. If one wants to restrict its range, for example, to be in the range 0 to 10, it can be declared as follows:

<element name="MyVector2"> <complexType base="IntegerVector" derivedBy="restriction"> <minInclusive value="0"/> <maxInclusive value="10"/> </complexType></element>

Furthermore, while the size of the vector is unconstrained—i.e. vectors can be of arbitrary length—in the original definition, it can be constrained using the lengthht, the minLength and maxLength facets. For example, the following vector definition constrains the vector to be of length five.

<element name="MyVector3"> <complexType base="IntegerVector" derivedBy="restriction">

<Length value="5"/> </complexType></element>

Or, the same list can be constrained to have a length between five and ten elements:

<element name="MyVector3"> <complexType base="IntegerVector" derivedBy="restriction">

<minLength value="5"/> <maxLength value="10"/>

</complexType></element>

Note that the Size attribute is not mandatory. If this attribute is not present, the number of elements appearing in the vector determines the size. The following shows an example of a null (zero-length) integer vector.

<MyIntegerVector/>

4.2.1.5 Description UseThe vector descriptors can be used whenever there is a need to represent arbitrary-sized vectors of numbers. For example, the Vector descriptor is used in the Model DS to represent the value of description as a point in an n-dimensional description space. Many visual and audio descriptors use these types to represent points in a feature space.

4.2.2 Matrix A matrix represents a two-dimensional matrix made up of row and columns of real or integer numbers.




4.2.2.4 Description ExampleConsider the matrix shown below:

When "MyMatrix1" element is declared as:

<element name="MyMatrix1" type="FloatMatrix"/>

this gives an instance as

<MyMatrix1 Size="4 4"> 1.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0

0.0 –1. 0.0 0.00.0 0.0 0.0 1.2

</MyMatrix1>

Note that in this declaration the range of matrix element is unbounded. If one wants to restrict its range to be within, for example, the interval [0.0, 1.0] , the range can be constrained as follows:

<element name="MyMatrix2"> <complexType base="FloatMatrix" derivedBy="restriction"> <minInclusive value="0.0"/> <maxInclusive value="1.0"/> </complexType></element>

Note also that in this declaration the dimension of matrix element is not constrained. If one wants to declare the element of two-dimensional matrix as "MyMatrix3", it should be declared as

<element name="MyMatrix3"> <complexType base="FloatMatrix" derivedBy="restriction"> <attribute name="Size"> <simpleType base="nonNegativeInteger" derivedBy="list"> <length value="2"/> </simpleType> </attribute> </complexType></element>

Furthermore, while the size of the matrix is not restricted in the original definition, it can be set using the "fixed" constraint in "MyMatrix4" as:

<element name="MyMatrix4"> <complexType base="FloatMatrix" derivedBy="restriction"> <attribute name="Size" use="fixed" value="3 2"/> </complexType></element>

resulting in an matrix such as:

<MyMatrix4> 0.1 0.2 0.3

0.4.0.5 0.6</MyMatrix4>

Editor's Note: The previous example is invaliud if the size attribute is mandatory.

4.2.2.5 Description UseThe matrix descriptors can be used whenever there is a need to represent arbitrary-sized matrices of numbers. For example, a matrix descriptor can be used to represent the translation function between two vector spaces.

4.2.3 Diagonal Matrix A diagonal matrix is a square matrix whose non-diagonal elements are all zero. Thus, such a matrix can be represented using a vector containing only its diagonal components.




4.2.3.4 Description ExampleConsider the matrix shown below:

Because this matrix is a diagonal matrix, it can be represented using the diagonal matrix descriptor as shown below:

<MyDiagonalMatrix Size=’4’>1.0 2.0 1.0 1.2

</MyDiagonalMatrix>

with the element declaration of

<element name="MyDiagonalMatrix" type="FloatDiagonalMatrix"/>

4.2.3.5 Description UseDiagonal vectors often occur when representing the covariance of a distribution. They are used in the Model DS to represent the covariance matrix for a Gaussian probability distribution function.

4.3 Probability Datatypes

The following types have been added at the request of the Audio group as they are used in the Spoken Content DS. As well, they are likely to be useful in many other places.

These types need to merged with the types in the Probably Model DS.

4.3.1 Probability Datatype The probability datatype represents a probability value in the range zero to one.

4.3.1.1 Descriptor Syntax

<simpleType name="probabilityType" base="decimal" dervivedBy=”restriction”><minInclusive value="0.0"/><maxInclusive value="1.0"/>

</simpleType>

4.3.1.2 Descriptor SemanticsName DefinitionprobabilityValue A datatype representing a probability between zero and one.


4.3.1.4 Description ExampleNot applicable.

4.3.2 Confidence Datatype The Confidence datatype represents representing a confidence or reliability value in the range zero to one. Higher values represent greater confidence.


<simpleType name="confidenceType" base="decimal" derivedBy=”restriction”>

<minInclusive value="0.0"/><maxInclusive value="1.0"/>

</simpleType>

4.3.2.2 Descriptor SemanticsName DefinitionprobabilityValue A datatype representing a probability between zero and one.


4.3.2.4 Description ExampleNot applicable.

4.3.3 Probability Vector DatatypeA Probability Vector represents a vector that defines a probability distribution.


<complexType name="ProbabilityVectorType" base="DoubleVector" derivedBy="restriction"><minInclusive value=”0.0”><maxInclusive value=”1.0”>

</complexType>

4.3.3.2 Descriptor Semantics

Name DefinitionProbabilityVectorType A vector of probability values. The values must sum to one.


4.3.3.4 Description ExampleSee examples for Vector.

4.3.4 Probability Matrix DatatypeThis datatype represents a matrix whose elements are probability values. It can be used to represent a conditional probability distribution, such as a transition matrix.


<complexType name="ProbabilityMatrixType" base="DoubleMatrix" derivedBy="restriction"><minInclusive value=”0.0”><maxInclusive value=”1.0”>

</complexType>

4.3.4.2 Descriptor Semantics

Name DefinitionProbabilityMatrixType A matrix of probability values. The values in each column must sum to one.


4.3.4.4 Description ExampleSee the example for the Matrix Datatype.

4.4 Histograms

4.4.1 Histogram D



4.4.1.3 Description ExtractionThere are no non-normative parts associated with this descriptor. Extraction, matching and other non-normative elements are specific to the feature the descriptor characterizes. For example, a color histogram.

4.4.1.4 Description ExampleThe following shows an example of using the Histogram descriptor:

<Histogram HistogramNormFactor="1000"> <HistogramValues>400 100 200 150 150 </HistogramValues></Histogram>

4.4.1.5 Description UseThis descriptor defines the histogram of a feature of a specific visual item, which can be representative for a region, a frame or a group of frames.

4.5 Quantizers

Currently, this descriptor is not being used anywhere in the MPEG-7 XM or WD. The Video group color quantitizor descriptor does not use this form. Therefore, no examples are given.

Recommend that it be removed.

4.5.1 Uniform Quantization D This descriptor the uniform quantization of a feature space.



4.5.1.3 Description ExtractionThere are no non-normative parts associated with this descriptor. Extraction, matching and other non-normative elements are specific to the feature the descriptor characterizes.


4.5.1.5 Description Use

5 Link to the media and Localization

5.1 References to Ds and DSs

5.1.1 Reference D This descriptor is a general tool for referencing part of the description (either a D or a DS instantiation).

5.1.1.1 Descriptor Syntax The normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).



5.1.1.4 Description ExampleThe Reference D is used in several DSs which rely on relation with element of the description. Typical examples include the relation graphs and the summarization. See sections 10.3 and 12.1 for description examples.

5.1.1.5 Description UseSee sections 10.3 and 12.1.

5.1.2 ReferenceToSegment D This descriptor is a general tool for referencing the description of a segment.




5.1.2.4 Description ExampleThe ReferenceToSegment D is used in several DSs which rely on relation with segment. Typical examples include the relation graphs and the summarization. See sections 10.3 and 12.1 for description examples.


5.1.3 ReferenceToProgram D This descriptor is a general tool for referencing the description of a Program.




5.1.3.4 Description ExampleThe ReferenceToProgram D is used in several DSs which rely on relation with Programs. A typical example include the summarization. See section 12.1 for description examples.


5.2 Unique Identifier

5.2.1 Uidentifier D This descriptor allows to identify the AV content under description. The identifier may be used to identify the content as a unique work being described (e.g., ISAN) or to identify its instances (e.g., SMPTE copy number).



5.2.1.3 Description ExtractionManual instantiation.

5.2.1.4 Description ExampleFor instance a book can be identified by using the International Standard Book Number (ISBN):

<UIdentifier IdOrganization=’ISO’ IdName=’ISBN’> 0-7803-5610-1</UIdentifier>

According to this also a International Standard Work Code (ISWC) can be specified:

<UIdentifier IdOrganization=’ISO’ IdName=’ISWC’>T-034.524.680-1

</UIdentifier>

5.2.1.5 Description UseThe Unique Identifier D can be used when an identification of AV content is required, either the AV content or an instance of it. It can also be used as a unique reference to external entities, for example, to identify the rights of the AV content via a rights identifier belonging to an external IPMP system.

5.3 Time elementsThe temporal descriptors are based on a restricted set of lexical expressions of the ISO8601 standard. Instead of specifying fractions of a second by an arbitrary number of decimals these are specified by counting prespecified fractions of a second. This approach is widely used within AV media for instance in MPEG-4 video. The lexical Time expressions can represent time both as used in media streams or real world time, e.g. production time.

The time specification is done either by using a Gregorian Date and day time or by counting time units relative to a time base. The time units are specified using an arbitrary number of full days plus a part of a day defined by hours, minutes,

etc. The MediaTime DS as described in section 5.3.12 specifies a structure using these descriptors to specify a compact mediatime segment. The Time DS specification differs from the MediaTime DS specification by an additional description of the Time Zone (TZD) according to UTC.

As described before there are two ways to specify time: a description of time using the common definition of date and time or a specification of time using arbitrarily defined time units. The common description of time can be used to describe an absolute (start) time point (Media/TimePoint D) or a time point relative to another time point (Media/RelTimePoint D) and a duration of e.g. a segment (Duration D). The specification using time units can be applied to describe a duration (Media/IncrDuration D) or a time point relative to another time point (Media/RelIncrTimePoint D).

5.3.1 TimePoint D This Time Descriptor specifies a time point according to the Gregorian dates and day time. The format is based on the ISO 8601 norm. To reduce converting problems only a subset of the ISO 8601 formats is used.


5.3.1.2 TimePoint D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.1.3 TimePoint D ExtractionNot applicable

5.3.1.4 TimePoint D ExamplesSee examples for Time DS.

Editor's Note: to support the current practice in AV material, the time expressions within the MDS specify a fraction of a second by counting predefined fractions instead of a decimal expression used within the date and time datatypes of XML Schema. Nevertheless XML Schema based time specifications can be represented as follows:For instance a timeInstant of 13:20:01.235 would be expressed using the TimePoint D by 13:20:01:235N1000. According to the number of used decimal digits the number of fractions of one second are 1000 as specified in the TimePoint. If this precision is used throughout a document it has only to be specified once at the beginning.

5.3.1.5 TimePoint D UseThe time descriptor can be used whenever there is a need to describe a time point using Gregorian date and day time. For example the time descriptor is used in the Time DS to specify a time point or the start time of a segment.

5.3.2 Duration D This Descriptor specifies the duration of a time period according to days and day time. The format is based on the ISO 8601 norm. To reduce converting problems only a subset of the ISO 8601 formats is used.

5.3.2.1 Duration D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.2.2 Duration D Semantic The normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.2.3 Duration D ExtractionNot applicable

5.3.2.4 Duration D ExamplesSee examples for Time DS.

5.3.2.5 Duration D UseThe Duration D descriptor can be used whenever there is a need to describe a duration of a time period. For example the duration descriptor is used in the Time DS to specify the duration of a time segment.

5.3.3 IncrDuration D To enable a simplified and efficient description of segment duration using a periodical time specification (e.g. periodic samples along the timeline) the IncrDuration D specifies a duration of such a segment by counting time units. Such a time unit can e.g. be a time increment between successive frames with respect to the world time when the sequence was recorded. The duration is then specified by the number of these time units.

5.3.3.1 IncrDuration D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.3.2 IncrDuration D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.3.3 IncrDuration D ExtractionNot applicable

5.3.3.4 IncrDuration D ExamplesSee examples for Time DS.

5.3.3.5 IncrDuration D UseThe Duration D descriptor can be used whenever there is a need to describe a duration of a time period by counting time units. For example the duration descriptor is used in the Time DS to specify the duration of a time segment.

5.3.4 RelTimePoint D This Time Descriptor specifies a time point relating to a time base using a number of days and time. The format is based on the ISO 8601 norm. The specification is similar to the one used for DurationD.

5.3.4.1 RelTimePoint D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.4.2 RelTimePoint D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

Editor's Note: The RelTimePoint D as well as the Duration D contains a specification of a difference in the TZD value to describe a different time zone with respect to the time base resp. start time. While this value is needed in the case of the RelTimePoint this might not be the case for the Duration D.

5.3.4.3 RelTimePoint D ExtractionNot applicable

5.3.4.4 RelTimePoint D ExamplesSee examples for TimeDS.

5.3.4.5 RelTimePoint D UseThe descriptor can be used whenever there is a need to describe a time point relative to a time base. For example the time descriptor is used in the Time DS to specify a time point or the start time of a segment relative to the start time of a recording.

5.3.5 RelIncrTimePoint D This Time Descriptor specifies a time point relative to a time base counting time units as already specified for the IncrDuration D. If for instance an addressing of a frame by counting frames is needed RelIncrTime D can be used referencing to the starting time stamp of the shot or the whole video as a time base.

5.3.5.1 RelIncrTimePoint D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.5.2 RelIncrTimePoint D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.5.3 RelIncrTimePoint D ExtractionNot Applicable

5.3.5.4 RelIncrTimePoint D ExamplesSee examples for Time DS.

5.3.5.5 RelIncrTime D UseThe descriptor can be used whenever there is a need to describe a time point relative to a time base by counting time units. For example the descriptor is used in the Time DS to specify a time point or the start time of a segment relative to the start time of a recording by counting samples or frames.

5.3.6 Time DS For the specification of time segments the Time DS is composed of two elements, the (start) time point and the duration. If only a time point has to be specified, the duration can be omitted.

5.3.6.1 TimeDS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.6.2 Time DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.6.3 Time DS ExtractionNot applicable

5.3.6.4 Time DS ExamplesDescription of an event that started on the 3rd October 1989 at 14:13 in Germany and that has a duration of 10 days:

<Time><TimePoint>1989-10-03T14:13+01:00</TimePoint><Duration>P10D</Duration>

</Time>

If one Picture is taken from this event every Day, the time description of the 3rd picture can be specified as follows:

<Time><RelIncrTimePoint timeunit="P1D"> 3 </RelIncrTimePoint>

</Time>

This example counts in time units of one Day and refers implicitly to the start time of the event as a TimeBase (e.g. starttime of the root node).The period of five days after the initial event can thus be specified by:

<Time> <RelIncrTimePoint timeunit="P1D">0</RelIncrTimePoint> <RelIncrDuration timeunit="P1D">5</RelIncrTimePoint></Time>

An event occurring in England, two hours and 20 minutes after the initial one in can be specified by:

<Time> <RelTimePoint>PT2H20M-01:00Z</RelTimePoint></Time>

This specification is similar to the example using RelIncrTimePoint. But here the time offset is specified by using a number of hours and minutes instead of counting time units. Additionally it is specified that the time zone is different. Thus the local time is not +01:00 as it was the case for the initial event but +00:00 UTC.

Further examples of time expressions can be found in the section of MediaTime which is inherited from Time.

5.3.6.5 Time DS UseThe description scheme can be used whenever there is a need to describe a time segment whether it is a time point or a whole period.

5.3.7 MediaTimePoint D This Media Time Descriptor specifies a time point according to the Gregorian dates and day time used within media time stamps. The format is based on the ISO 8601 norm. To reduce converting problems only a subset of the ISO 8601 formats is used.


5.3.7.2 MediaTimePoint D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.7.3 MediaTimePoint D ExtractionNot applicable

5.3.7.4 MediaTimePoint D ExamplesSee examples for MediaTime DS and Time DS.

5.3.7.5 MediaTimePoint D UseThe media time descriptor can be used whenever there is a need to describe a media time point using Gregorian date and day time. For example the time descriptor is used in the MediaTime DS to specify a time point or the start time of a media segment.

5.3.8 MediaDuration D This Descriptor specifies the duration of a media time period according to days and day time of media time stamps. The format is based on the ISO 8601 norm. To reduce converting problems only a subset of the ISO 8601 formats is used.

5.3.8.1 MediaDuration D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.8.2 MediaDuration D Semantic The normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.8.3 MediaDuration D ExtractionNot applicable

5.3.8.4 MediaDuration D ExamplesSee examples for MediaTime DS and Time DS.

5.3.8.5 Duration D UseThe MediaDuration D descriptor can be used whenever there is a need to describe a duration of a media time period. For example the media duration descriptor is used in the MediaTime DS to specify the duration of a media time segment.

5.3.9 MediaIncrDuration D To enable a simplified and efficient description of a media segment duration using a periodical time specification (e.g. periodic samples along the timeline) the MediaIncrDuration D specifies a duration of such a segment by counting time units. Such a time unit can e.g. be the time increment of the timestamps of successive frames in a video stream. The duration is then specified by the number of these time units.

5.3.9.1 MediaIncrDuration D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.9.2 MediaIncrDuration D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.9.3 MediaIncrDuration D ExtractionNot applicable

5.3.9.4 MediaIncrDuration D ExamplesSee examples for MediaTime DS and Time DS.

5.3.9.5 MediaIncrDuration D UseThe MediaDuration D descriptor can be used whenever there is a need to describe a duration of a media time period by counting time units. For example the duration descriptor is used in the MediaTime DS to specify the duration of a media time segment.

5.3.10 MediaRelTimePoint D This Media Time Descriptor specifies a time point relating to a time base using a number of days and time. The format is based on the ISO 8601 norm. The specification is similar to the one used for DurationD.

5.3.10.1 MediaRelTimePoint D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.10.2 MediaRelTimePoint D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.10.3 MediaRelTimePoint D ExtractionNot applicable

5.3.10.4 MediaRelTimePoint D ExamplesSee examples for MediaTimeDS and TimeDS.

5.3.10.5 MediaRelTimePoint D UseThe descriptor can be used whenever there is a need to describe a media time point relative to a time base. For example the time descriptor is used in the MediaTime DS to specify a time point or the start time of a media segment relative to the start time of a recording.

5.3.11 MediaRelIncrTimePoint D This Media Time Descriptor specifies a time point relative to a time base counting time units as already specified for the MediaIncrDuration D. If for instance an addressing of a frame by counting frames is needed MediaRelIncrTime D can be used referencing to the starting time stamp of the shot or the whole video as a time base.

5.3.11.1 MediaRelIncrTimePoint D SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.11.2 MediaRelIncrTimePoint D SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.11.3 MediaRelIncrTimePoint D ExtractionNot Applicable

5.3.11.4 MediaRelIncrTimePoint D ExamplesSee examples for MediaTime DS and Time DS.

5.3.11.5 MediaRelIncrTimePoint D UseThe descriptor can be used whenever there is a need to describe a media time point relative to a time base by counting time units. For example the descriptor is used in the MediaTime DS to specify a media time point or the start time of a media segment relative to the start time of a recording by counting samples or frames.

5.3.12 MediaTime DS For the specification of time segments according to the time stamps of the media the MediaTime DS is composed of two elements, the (start) time point and the duration. If only a time point has to be specified, the duration can be omitted.

5.3.12.1 MediaTime DS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.12.2 MediaTime DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.3.12.3 MediaTime DS ExtractionNot applicable

5.3.12.4 MediaTime DS ExamplesSuppose we have a video consisting of the following segments:- Segment1: 0-0.1(sec)- Segment2: 0.1-1(sec)- Segment3: 1-10(sec)- Segment4: 10-20(sec)- Segment5: 20-1000(sec)

The video and these segments are described using the MediaTime DS in different ways in this section. Remind that this is done to explain the possibilities of the MediaTime DS. In a description usually only one of these possibilities are used throughout the document.

<!—Specification of the video location --><MediaLocator> <MediaURL> http://www.mpeg7.org/the_video.mpg </MediaURL></MediaLocator>

<MediaTime> <MediaRelTimePoint TimeBase="MediaLocator(1)"> P0S

</MediaRelTimePoint> <MediaDuration>PT1N10F</MediaDuration></MediaTime>

This segment is described by using the start time of the video as time base and specifying the start time and the duration of the segment.

<MediaTime> <MediaRelTimepoint> PT1N </MediaRelTimePoint> <MediaDuration> PT9N </MediaDuration></MediaTime>

Segment 2 is also specified by the start time of the segment and the duration. But this time, the setting for TimeBase and the number of fractions of one second are inherited from the first instanciation of MediaTime DS.

<MediaTime> <MediaRelIncrTimePoint timeunit="PT1N30F">30</MediaRelIncrTimePoint> <MediaIncrDuration>270</MediaIncrDuration></MediaTime>

For segment 3 counting timeunits are used for the specification of the start time and the duration. The TimeUnit is specified as 1N which is 1/30 of a second according to 30F. But if needed also a timeunit for the exact sample rate of 29.97...Hz is possible with: 30000F and 1001N.

<MediaTime> <MediaRelIncrTimePoint>300</MediaRelIncrTimePoint> <MediaIncrDuration>300</MediaIncrDuration></MediaTime>

This segment is described similar to the previous one but this time the timeunit is inherited from the first MediaTime element in this document which uses incremental time specification (e.g. the previous MediaTime element).

<MediaTime> <MediaRelTimePoint>PT20S</MediaRelTimePoint> <MediaDuration>PT16M20S</MediaDuration></MediaTime>

The last segment again specifies the time using seconds and minutes. Specifying the duration by P980S is also allowed. But if you want to count timeunits (in this case seconds), you should use the MediaIncrDuration:

<MediaIncrDuration timeunit="PT1S">980</MediaIncrDuration>

For another example, consider a description of a video segment with its root segment specified with:

<MediaTime><MediaTimePoint >1989-10-03T14:13:02:0F30 </MediaTimePoint>

<MediaIncrDuration timeunit="PT1N30F">1340

</MediaIncrDuration>

</MediaTime>

Since the video itself is assumed to have a framerate of 30 frames per second fractions of one second is set to 30F and the timeunit is set to 1N. Thus, with specifying a segment of 1340 frames also the duration of the segment is described.

To specify a single frame in a video sequence, the frame numbers or the timestamps can be used in the following way:

<MediaTime><MediaRelIncrTimePoint>

561</MediaRelIncrTimePoint>

</MediaTime>

In this case, the 561st frame is referenced using the very beginning of the described segment as a start point for counting e.g. specifying the start point of the root segment as the TimeBase (implicit). The video itself is displayed at a framerate of 30 frames per second. In the example each frame is counted (1N) thus the timestamp is 18s:21.

If you don’t want to relate the collapsed time to the counted frames you can use timestamps directly:

<MediaTime><MediaRelTimePoint>PT18S21N</MediaRelTimePoint>

</MediaTime>

Now suppose the whole video is divided into subsegments (e.g. a shot from 500 to 600):

<MediaTime><MediaRelIncrTimePoint> 500 </MediaRelIncrTimePoint><MediaIncrDuration > 100 </MediaIncrDuration>

</MediaTime>

and you want to address within this shot the above mentioned frame:

<MediaTime><MediaRelIncrTimePoint TimeBase="../../../MediaTime">61</MediaRelIncrTimePoint>

</MediaTime>

In this case the starttime of the shot (e.g. the father node of the node which contains the present MediaTime DS) is referenced as TimeBase.

5.3.12.5 MediaTime DS UseThe description scheme can be used whenever there is a need to describe a time segment whether it is a time point or a whole period using the time information of the AV content. For example the MediaTime DS is used within the MediaLocator DS.

5.4 Media locatorsMedia locators are used to specify the "location" of AV content.

5.4.1 MediaURL D The MediaURL uses an URI to locate the AV content.

5.4.1.1 MediaURL D Syntax

<simpleType name="MediaURL" base="uri"/>

5.4.1.2 MediaURL D Semantic

Name DefinitionMediaURL Descriptor specifying the location of AV content using an URI

5.4.1.3 MediaURL D ExtractionNot applicable

5.4.1.4 MediaURL D Examples

<MediaURL>http://www.mpeg7.org/demo.mpg </MediaURL>In this example the location of a mpg file is specified by using an URI.

5.4.1.5 MediaURL D UseThe descriptor can be used whenever there is a need to specify the location of AV content by using an URI.

5.4.2 MediaLocator DS The MediaLocator DS is used to specify the "location" of a particular image, audio or video segment by referencing the media data. There are four types of MediaLocators: the VideoSegmentLocator, the AudioSegmentLocator, the ImageLocator, and the SoundLocator.

5.4.2.1 MediaLocator DS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.2.2 MediaLocator DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.2.3 MediaLocator DS ExtractionNot applicable

5.4.2.4 MediaLocator DS Examples

<MediaLocator> <MediaURL> http://www.mpeg7.org/demo.mpg </MediaURL> <MediaTime> <RelTime>PT3S</RelTime> <Duration>PT10S</Duration> </MediaTime></MediaLocator>

In this example the location of a video segment is specified by the URI of a video file and the relative start time with respect to the beginning of the file and the duration of the segment.

5.4.2.5 MediaLocator DS UseThis description scheme can be used whenever there is a need to specify the location of AV content with one of the contained mechanisms.

5.4.3 VideoSegmentLocator DS

5.4.3.1 VideoSegmentLocator DS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.3.2 VideoSegmentLocator DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.3.3 VideoSegmentLocator DS ExtractionNot applicable

5.4.3.4 VideoSegmentLocator DS ExamplesSee example for MediaLocator DS.

5.4.4 ImageLocator DS

5.4.4.1 ImageLocator DS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.4.2 ImageLocator DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.4.3 ImageLocator DS ExtractionNot applicable

5.4.4.4 ImageLocator DS ExamplesSee example for MediaLocator DS.

5.4.5 AudioSegmentLocator DS

5.4.5.1 AudioSegmentLocator DS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.5.2 AudioSegmentLocator DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.5.3 AudioSegmentLocator DS ExtractionNot applicable

5.4.5.4 AudioSegmentLocator DS ExamplesSee example for MediaLocator DS.

5.4.6 SoundLocator DS

5.4.6.1 SoundLocator DS SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.6.2 SoundLocator DS SemanticThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

5.4.6.3 SoundLocator DS ExtractionNot applicable

5.4.6.4 SoundLocator DS ExamplesSee example for MediaLocator DS.

6 Basic elements

6.1 Textual description

6.1.1 Language attribute

6.1.1.1 Attribute SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.1.1.2 Attribute SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.1.1.3 Description ExtractionThe attribute may be instantiated manually, or can be determined automatically from the enclosed textual description. In most cases, the attribute will be specified manually.


Example of instantiation:

Example of instantiation:

<TextAnnotation xml:lang="en-us"> This is a nice apartment. </TextAnnotation>

<TextAnnotation xml:lang="en-uk"> This is a nice flat. </TextAnnotation>

6.1.1.5 Description UseThere are two situations in which specification of language is required in MPEG-7: (1) to describe the language of the description information itself or (2) to describe the language of the content (esscence). In the first case the language is a property of the MPEG-7 description—i.e. the metadata—and in the second case the language information is itself metadata. Depending on which case applies, MPEG-7 provides two solutions. In case (1), a special attribute (xml:lang) is used to specify the language used to write the description. In case (2), the Language descriptor is used. Note that the content itself may be textual.

The XML Language attribute must be used for all Descriptors where the language needs to be identified (e.g., Annotation or Title). In the case of controlled vocabularies and thesauri (if they are closed), the list can be mapped to any language, although a default one may be used for the specification.This attribute is used to specify the language in which a textual description is expressed. It is what enables MPEG-7 to support multilingual content description.

6.1.2 Language Datatype



6.1.2.3 Description ExtractionIn many cases this information will created manually. However, it is possible to extract it automatically using language identification for either text or audio.

6.1.2.4 Description ExampleTo express the preferred language for content, an element can be declared as follows.

<element name="PreferredLanguage" base="language"/>

The following example states that the preferred language is English.

<PreferredLanguage>en</PreferredLanguage>

6.1.2.5 Description UseThe language datatype is used for identifying the language of the AV content, either the audio or the subtitles. It may also be used for selection of a preferred language or to try to perform automatic translations to the usertranslate to a selected language.

6.1.3 ControlledTerm D

This descriptor is currently in a Core Experiment and is likely to change from the syntax specified. Until the Core Experiment completes, this section is being left as is.





<element name=’Genre’ type=’ControlledTerm’ />

<Genre CSName=’AllMusicGuideGenres’ CSLocation=’http://www.xxx/genres/ CSTermId=’gen:34:2’ >Hard Rock</Genre>

<Genre>My Favourite Tunes</Genre>

6.1.3.5 Description UseThe Controlled Term descriptor is used for textual annotation. It can be used in two modes. The basic mode (string without any classification scheme associated) allows free text annotation. In the extended mode, the value of the descriptor is obtained from a classification scheme (thesaurus, ontology, etc.). In this case, the information contained in the classification scheme can be used by applications in order to provide additional semantic and translation features. The CSLocation and CDTermId values are to be interpreted by the application. The classification scheme is supposed to be defined outside the MPEG-7 standard. Nevertheless, in the future, it may be possible to have Classification Schemes defined as DDL compliant files allowing to access to MPEG-7 or application defined classification Schemes.

A possible scenario may be the following:

Classification Scheme File

<cs name=’mpeg7:PlaceRole’>

<enumeration> <literal>Shooting location</literal>

<literal>Represented location</literal> <literal>Postal Address</literal> <literal>General Locator</literal>

</enumeration></cs>

Description

<PlaceRole CSname=’mpeg7:PlaceRole’ CSLocation=’ http://www.mpeg.org/mpeg7/cs/placerole.ddl’>

Shotting location</PlaceRole>

6.1.4 TextualDescription Datatype

6.1.4.1 Datatype SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.1.4.2 Datatype Semantic The normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.1.4.3 Datatype ExtractionNot applicable

6.1.4.4 Datatype ExamplesSee example in Annotation DS.

6.1.4.5 Datatype UseThis datatype is used as a base type for textual descriptions where the language needs to be identified.

6.1.5 StructuredAnnotation DS

6.1.5.1 Description Scheme SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.1.5.2 Description Scheme SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).


6.1.5.4 Description Example<StructuredAnnotation>

<Who>Fernado Morientes</Who><WhatAction CSName=’Sports’

CSLocation=’www.eurosport.xxx/cs/soccer/’> scoring goal

http://www.mpeg.org/mpeg7/cs/placerole.ddl

http://www.mpeg.org/mpeg7/cs/placerole.ddl

</WhatAction> <When>Spain Sweden soccer match</When><TextAnnotation xml:lang=’en-us’> This was the first goal of this match.</TextAnnotation>

</StructuredAnnotation>


The ambiguity of natural language representation is one of the drawbacks of free text annotation. Structuring the annotation, in conjunction If the annotations are provided in a structured format and with associated thesaurus or controlled vocabulary lists, they would be of great helps for users to retrieve materials. This allows a simple but powerful (non-ambiguous) annotation and search. Nevertheless, free text can also be used in any of the 6W fields. In this case, the structure itself adds additional information to provides the added value to the pure free text annotation

The StructuredAnnotation DS can be used for simple semantic annotation. Using the 6W descriptors a structured textual description is possible, including the possibility of using associated classification schemes. Also a free text annotation is provided for short descriptions with a free style.

6.2 Description of persons

This section contains the description schemes and datatypes concerned with the description of people.

6.2.1 Person DS The Person DS contains the description tools (Ds and DSs) intended for a description of persons (e.g., an actor, director, character, dubbing actors, characters), organizations (e.g., a company) and groups of people (a rock band).




6.2.1.4 Description Example<!- - specification - -><element name=’Creator’ type=’Person’/><element name=’Publisher’ type=’Person’/>...

<!- - description - -><Creator xsi:type="PersonGroup">

<Name>Rolling Stones</Name><Member xsi:type="Individual">

<Name><GivenName initial="M">Mick</GivenName><FamilyName>Jagger</FamilyName>

</Name></Member><Member xsi:type="Individual">

<Name><GivenName>Keith</GivenName>

<FamilyName>Richard</FamilyName></Name>

</Member><Member xsi:type="Individual">

<Name><GivenName>Ron</GivenName><FamilyName>Wood</FamilyName>

</Name></Member><Member xsi:type="Individual">

<Name><GivenName>Charly</GivenName><FamilyName>Watts</FamilyName>

</Name></Member>

</Creator>

<Publisher xsi:type="Organization"><Name>ACME Publishing Co.</Name><ContactPerson> .... </ContactPerson>

</Publisher>

6.2.1.5 Description UseThe Person DS, and the description schemes that derive from it can be used to describe real, historical, or fictional persons. This description scheme can be used for describing both person depicted inside the content itself—historical, fictional, or real—and person in the real world—the actor, users, and others who relate in some fashion to the content What distinguishes these two uses of the Person DS is the context: the description scheme containing the Person DS. The Person DS is used as the generic way to describe persons, including individuals, organizations, and groups.

6.2.2 Person Name Datatype

6.2.2.1 Datatype SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.2.2.2 Datatype Semantic The normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

6.2.2.3 Datatype ExtractionNot applicable

6.2.2.4 Datatype ExamplesThe following show several different examples of using this datatype to describe a person’s name. It also illustrates through examples some of the different description variations supported by this description scheme.

The following examples show different variations for describing the name for the figure "Edward Hooever".

<Name>

<GivenName>Edwina J. Hoover</GivenName></Name>

<Name>

<GivenName>Edwina</GivenName><GivenName initial="J"></GivenName>

<FamilyName>Hoover</FamilyName></Name>

<Name>

<GivenName>Edwina</GivenName><GivenName initial="J">Jermina</GivenName><FamilyName>Hoover</FamilyName><FamilyName>van Blumen</FamilyName>

<FamilyName initial="Jr.">Junior</Title><Title>Professor</Title>

<Title>F.R.S.</Title></Name>

Another example shows a name for which only a given name exists:

<Name><GivenName>Dinesh</GivenName>

</Name>

In the following Spanish name, the given name for the person can not be broken down further. Notice also the use of the language attribute xml:lang to specify the language of the name.

<Name xml:lang="es"><GivenName initial="J.M." abbrev="Chema">José María</GivenName> <FamilyName>Martínez</FamilyName><FamilyName>Sánchez</FamilyName>

</Name>

6.2.2.5 Datatype UseThis datatype is used as a datatype for whenever a person’s name needs to be represented in a structured form.

6.2.3 Individual DS The Individual DS contains the description tools (Ds and DSs) intended for a description ofto describe persons (e.g., actor, director, character, dubbing actor, etc.).




6.2.3.4 Description ExampleSee the example under the Person DS section.

6.2.3.5 Description UseThe Individual DS can be used to describe individual persons by their names.

6.2.4 PersonGroup DS





See the example under the Person DS section. Description Use

6.2.5 Organization DS The Organization DS contains the description tools (Ds and DSs) intended for a description of organizations (e.g., a company).




6.2.5.4 Description Example<Organization id="ORG1">

<OrganizationName> MyCompany </OrganizationName><ContactPerson>

<Name><GivenName> MyGivenName </GivenName><FamilyName> MyFamilyname </FamilyName>

<Name></ContactPerson>

</Organization>

6.2.5.5 Description UseThe Organization DS can be used to describe organizations, companies, etc.

6.3 Description of places

6.3.1 Place DS The Place DS contains the description tools (Ds and DSs) intended for a description of locations.





<element name=’Location’ type=’Place’ />

<Location>

<PlaceName xml:lang=’en’>Rome</PlaceName><PlaceName xml:lang=’it’>Roma</PlaceName><PlaceRole>shooting location</PlaceRole><Country>it</Country>

</Location>

<!- - part of Organization DS - -> <Address>

<PlaceName xml:lang=’en’>Madrid</PlaceName><Country>es</Country><PlaceRole>postal address</PlaceRole><Planet>Earth</Planet>

<GPSCoordinates GPSsystem=’’>XXXX</GPSCoordinates><Region>cam</Region> <PostingIdentifier>E-28040</PostingIdentifier><AdminstrativeUnit>city</AdminstrativeUnit><PostalAddress> E.T.S.Ing. Telecomunicación, Universidad Politécnica de Madrid, Ciudad Universitaria s/n</PostalAddress><InternalCoordinates>C-306</InternalCoordinates>

</Address>

6.3.1.5 Description UseThe Place DS is used as a primitive DS to describe places, either real (addresses, locations) or fictional (e.g. the setting for a movie).

6.4 Description of importance or priorityThe Weight DS is a primitive DS for describing with weights the importance or priority of the elements in a content description.

6.4.1 Weight DS

6.4.1.1 Description Scheme Syntax



<complexType name="WeightValue"><choice minOccurs="0" maxOccurs="1">

<element ref="Reference"/><element name="KeyName" type="string"/>

</choice><attribute name="ReferenceValue" type="float" use="optional"/>

</complexType>

<complexType name="Weight"><element name="WeightValue" type="mds:WeightValue"

maxOccurs="unbounded"/><attribute name="id" type="ID"/>

<attribute name="reliability" type="float"/><attribute name="WeightName" type="string"/>

</complexType>

6.4.1.2 Description Scheme Semantics

Semantic of the WeightValue DS:Name DefinitionWeightValue A weight value assigned to a component identified by a reference, or a weighting

whose criterion is identified by a key name. In a reference to a description element (descriptor or description scheme), the weight value represents the relative importance of that description element among multiple elements within the same context.

ReferenceValue The weigh value assigned to the component in the range 0 to 1. If this value is high (close to 1), it means that the instance is very important.

Reference The reference identifies the element of the description (descriptor or description scheme) to which the weight value is attached.

KeyName A string that directly specifies the criterion by which the reference value is assigned. For example, if the weight value is assigned based on how exciting a segment is, the key name can have the value "exciting".

Semantic of the Weight DS:Name DefinitionWeight A primitive DS for describing various weights, importance or priority of elements

in content descriptions. Id Identifier for each instance of the Weight DS.Reliability Describes how reliable the weights are. In other words, it is a confidence measure

of the weights. For example, if the reliability is low, the weight value is not trustworthy; on the other hand, if the reliability is high, the weight value represents the importance of the target description very well. The range of this value is 0.0 to 1.0.

WeightName The semantics of the reference value. This value specifies how to interpret the reference value.

WeightType The kind of weighting information.WeightValue Weight values that are associated to this particular instance of the weight DS.

6.4.1.3 Description ExtractionThere are no extraction methods defined for the Weight DS, as it only provides a framework that can be used for a wide variety of different kinds of weighting schemes. The details of the extraction method will depend on the particular application.

6.4.1.4 Description ExampleThe following example shows how the Weight DS can be used to attach importance (or credibility) values to the descriptors of a region in a still image. In this example, the weightings could be used to determine the importance of the descriptors when matching.

<StillRegion id="IMG0001"><Weight id="Weight01" WeightName="AUTOMATIC_DescriptorWeight"

reliability="0.900000"><WeightValue referenceValue="0.254">

<Reference idref="ColorHist01"/></WeightValue><WeightValue referenceValue="0.265">

<Reference idref="EdgeHist01"/></WeightValue><WeightValue referenceValue="0.223">

<Reference idref="Activity01"/></WeightValue>

<WeightValue referenceValue="0.258"><Reference idref="BoundingBox01"/>

</WeightValue></Weight><ColorHistogram id="ColorHist01">

<-- Details omitted --></ColorHistogram><EdgeHistogramid id="EdgeHist01">

<-- Details omitted --></EdgeHistogramid><MotionActivity id="Activity01">

<-- Details omitted --></MotionActivity><BoundingBox id="BoundingBox01">

<-- Details omitted --></BoundingBox><-- Other Details omitted -->

</StillRegion>

6.4.1.5 Description UseThe Weight DS can be used to represent many different kinds of weightings, from weightings of low-level descriptors to high-level semantics concepts. For example, we can filter content with the help of the weights, or we can find the data of interest using a list of descriptors or a combination of descriptors with different weights. When we have multiple descriptions of an AV content (e.g. semantic keywords, video descriptors, and audio descriptors), there can be a difference in the importance or credibility of each description. In such cases, we can browse audiovisual content based on the importance of the descriptors or the credibility of the descriptions.

6.5 Entity-relationship graphAn EntityRelationshipGraph DS is a primitive DS for graph representation of entities (e.g. segments) and relationships among the entities. Although hierarchical structures such as trees are adequate for efficient access and retrieval, some relationships can not be expressed using such structures. The EntityRelationshipGraph DS is defined to add flexibility in describing more general relationships among entities.

6.5.1 EntityRelationshipGraph DS



6.5.1.3 Description ExtractionThe EntityRelationshipGraph DS is a general tool for graph representation. It is intended to be used as a set of super-classes from which concrete graphs are derived such as the SegmentRelationshipGraph. See section 10.3 for extraction examples.

6.5.1.4 Description ExampleThe EntityRelationshipGraph DS is a general tool for graph representation. It is intended to be used as a set of super-classes from which concrete graphs are derived such as the SegmentRelationshipGraph. See section 10.3 for examples.

6.5.1.5 Description UseThe EntityRelationshipGraph DS is a general tool for graph representation. It is intended to be used as a set of super-classes from which concrete graphs are derived such as the SegmentRelationshipGraph. See section 10.3 for description use.

6.6 Description of Time Series

6.6.1 TemporalInterpolation D

The TemporalInterpolation D describes a temporal interpolation by connected polynomials. This can be used to approximate variable values that change with time—such as an object position in a video. The description size of the temporal interpolation is usually much smaller than describing all values. In Figure 1, 25 real values are expressed by five linear interpolation functions or two quadratic interpolation functions.

Figure 2: Real Data and Interpolation Functions.

6.6.1.1 Description Syntax

Editor's Note: The full syntax of this descriptor can only be expressed if maxOccursPar and minOccursPar are added to the MPEG-7 DDL. However, the definition below does not use these features.

The InterpolationFunctionID type is not defined.



<element name="TemporalInterpolation" type="mds:TemporalInterpolation"/>

<complexType name="TemporalInterpolation"><choice>

<element name="WholeInterval" type="mds:Time"/><element name="KeyPointT" type="mds:Time" maxOccurs="unbounded"/>

</choice><sequence maxOccurs="unbounded"> 

 <element name="KeyPoint" type="float" maxOccurs="unbounded"/>  <sequence minOccurs="0" maxOccurs="unbounded">  <element name="FunctionID" type="mds:InterpolationFunctionID"/> <simpleType base="Integer"/>

<minInclusive value="-1"/> <maxInclusive value="1"/> </simpleType>

</element> <element name="FunctionParameterValue" type="float" minOccurs="0"/> </sequence> </sequence>

<attribute name="NKeyPoints" type="positiveInteger" use="required"/><attribute name="Dimension" type="positiveInteger" use="required"/>

</complexType>

6.6.1.2 Description SemanticsSemantics of TemporalInterpolation D:

Name DefinitionTemporalInteroplation Description of temporal interpolation. The temporal interpolation consists of

connected polynomials.WholeInterval This specifies the whole temporal interval during which this descriptor is valid. If

this field is defined, the time interval between each successive pair of key points is constant. The length of these intervals is derived by duration/(NKeyPoints-1). If the time intervals are not constant, all key points time instants are described explicitly instead.

KeyPointT This field specifies the time instant of each key point. This appears ‘NKeyPoints’ times.

KeyPoint This field specifies the position of each key point, for the given Dimension. This appears ‘NKeyPoints’ times for each dimension.

FunctionID This specifies the type of interpolation function used, as specified in the following table. If the sequence of FunctionIDs is skipped in some dimension, the default interpolation function (linear interpolation) is used for each time interval in this dimension. Therefore, this field occurs either zero or NKeyPoints-1 for each dimension.

FunctionParameterValue This field specifies the coefficient of the interpolation function when relevant, i.e. aa (see Table 2) in the case FunctionID=1.

NkeyPoints This field contains the number of sampled positions, denoted as key points, used for knots of interpolation functions.

Dimension This indicates the dimension of the interpolation function. Table 1 shows examples of the value of ‘Dimension’.

The TemporalInteroplation D describes the temporally variable values by connected polynomials. The dimension of the variable is specified by Dimension (except for the temporal element) and some examples are shown in Table 1. The connection point is called "key point". The number of key points is described by NkeyPoints. The time of key points is described by WholeInterval or an array of KeyPointT. WholeInterval is used only for constant intervals and the length of the interval is derived by dividing the specified duration by NkeyPoints-1. The values of the key points are described by an array of KeyPoint.

Table 1: Examples of the value of "Dimension"

D or DS using TemporalInterpolation Value of ‘Dimension’

2D MotionTrajectory 23D MotionTrajectory 3ParameterTrajectory (Translational Model) 2ParameterTrajectory (Affine Transformation Model) 6ParameterTrajectory (Parabolic Model) 12

The TemporalInterpolation D can use two types of interpolation functions: 1 st order and 2nd order polynomials. The type of the interpolation function is indicated by FunctionID. When the 1st order polynomial is used (FunctionID=0), the function can be calculated from key points. On the other hand, when the 2 nd order polynomial is used (FunctionID=1), the 2nd order coefficient is described by FunctionParameterValue. The relations between FunctionID, interpolation functions, key points and FunctionParameterValue are shown in Table 2. The constraint is made so that all the interpolation function passes the key points at both ends.

Table 2: Interpolation Function Specified by FunctionID and FunctionParameterValue

FunctionID Function Form Function ParameterValue Constraint

-1 (none) (none) (Not Applicable)

0 (none)

1

One of the two types of polynomials can be selected by each interval. If only the 1 st order polynomials are used for all intervals, all descriptions of FunctionID can be omitted.

6.6.1.3 Description ExtractionThe algorithm CALC_INTERPOLATION gives an example of key points extraction method for variable key point intervals. It determines key points sequentially based on the approximation error of interpolation functions.

In the algorithm, the initial interpolation is calculated between the first two points. When a new point is added, the interpolation is modified to include the new point. If the approximation error of the interpolation becomes larger than a given threshold, then a key point is inserted before the latest point. After insertion of a key point, the next interpolation is calculated between the latest two points. Repeating the same procedure to the new interpolation, a set of interpolations is generated finally. When the variables are well interpreted by the interpolation model, this algorithm gives long intervals, and vise versa.

To describe the algorithm formally, let’s define some descriptions here. Let be a number of variables

(Dimension) and denote a value of the -th variable at time by . It is assumed that exists at

a time sequence . In the algorithm, a function is calculated sequentially as a

candidate for an interpolation function of the -th variable between and . To evaluate the candidate function, an

approximation error is calculated by

.

where and denote the start and the end time of the interval that the interpolation is estimated in, respectively. If the error is in an acceptable range, the length of the key point interval is increased by incrementing

. Otherwise, a key point is put before to guarantee the minimum approximation error. The following are the steps of the algorithm.

CALC_INTERPOLATION:1. (Initialization) Set , .

2. (Interpolation Calculation) Calculate interpolation functions . Least square

estimation can be used under conditions of and .

3. (Interpolation Evaluation) Compute the approximation error . If there exists

which is greater than the corresponding threshold, go to Step 4. Otherwise, go to Step 5.

4. (KeyPoint Insertion) Accept as interpolation functions and set

.(Increment and Termination) Set . If is available go to Step 2. Otherwise, terminate.

Figure 3: Sequential key points selection and interpolation calculation method: If the error of is greater

than the threshold, then let be a key point and fix .

6.6.1.4 Description ExamplesIn the following example, the position in 2D (x and y axes) is described by the TemporalInterpolation D. Since 4 key points are used in the example, 4 positions of the key points and 3 interpolation functions for each axis are described. For the y-axis only the 1st order polynomials are used for the interpolation function, so the descriptions of FunctionID are omitted.

<TemporalInterpolation NkeyPoints="4" Dimension="2">  <KeyPointT> 0.0 </KeyPointT> <KeyPointT> 2.0 </KeyPointT> <KeyPointT> 10.5 </KeyPointT> <KeyPointT> 15.0 </KeyPointT>  <KeyPoint> 118.9 </KeyPoint> <KeyPoint> 102.1 </KeyPoint> <KeyPoint> 82.35 </KeyPoint> <KeyPoint> 85.5 </KeyPoint>  <FunctionID> 2 </FunctionID>  <FunctionParameterValue> 2.1 </FunctionParameterValue> <FunctionID> 1 </FunctionID>  <FunctionID> 2 </FunctionID> <FunctionParameterValue> 0.2 </FunctionParameterValue>  <KeyPoint> 210.0 </KeyPoint> <KeyPoint> 220.8 </KeyPoint> <KeyPoint> 228.9 </KeyPoint> <KeyPoint> 215.1 </KeyPoint> </TemporalInterpolation>

6.6.1.5 Description UseThe TemporalInterpolation D is used for description of any kind of temporally changing value. In the MotionTrajectory D, it is used for describing the trajectory of the center of gravity for an object. In the SpatioTemporalLocator DS, it is used to represent temporally changing motion parameter values.

6.6.1.6 Description Coding The following describe a binary coding scheme for the TemporalInterpolation D.

No. of bits MnemonicTemporalInterpolation { NKeyPoints 16 uimsbf Dimension 4 uimsbf ConstantTimeInterval 1 bslbf If( ConstantTimeInterval ){ WholeInterval ? TimeDS } else { for( i=0; i<NKeyPoints; i++ ){ KeyPointT ? TimeDS } } for( i=0; i<Dimension; i++ ){ for( j=0; j<NKeyPoints; j++ ){ KeyPoint 32 uimsbf } DefaultFunction 1 bslbf if(!DefaultFunction){ for( j=0; j<NKeyPoints-1; j++ ){ FunctionID 2 uimsbf if (FunctionID>0) { FunctionParameterValue 32 uimsbf } } } }}

NKeyPointsThe number of key points is given by 16 unsigned bits: from 0 to 65535. DimensionThe dimension of the interpolation function is given by 4 unsigned bits: from 0 to 15. ConstantTimeIntervalThis one bit is set to "1" if time intervals between key points are constant, to 0 else. WholeInterval The binary representation format should be defined with Time specification.KeyPointT The binary representation format should be defined with Time specification.KeyPointThis is a float specifying the key point position.DefaultFunctionThis one bit is set to "1" if default interpolation function (linear interpolation) is used in the current dimension, to 0 else. FunctionIDThese bits specify the Function_ID’s deining the type of interpolation function used, as defined in the paragraph above. The semantics of the bits is given below.

Function_ID Meaning00 0 (first order function)01 1 (second order function)11 -1 (no function)

FunctionParameterValueThis is a float specifying the interpolation function parameter.

6.7 Scalable Series

Scalable series are datatypes for series of values (scalars or vectors). They allow the series to be scaled (downsampled) in a well-defined fashion. Two types are available for use in descriptors: SeriesOfScalarType and SeriesOfVectorType. They are useful to build descriptors that contain time series of values.

6.7.1 SeriesOfScalarType;This descriptor represents a general series of scalars. Use this abstract type within descriptor definitions. Applications will instantiate the series as one of several concrete subtypes defined below.


<complexType name="SeriesOfScalarType" abstract="true"> <attribute name="nElements" type="nonNegativeInteger" use="default" value="1"/></complexType>

6.7.1.2 Description Semantics

Name DefinitionSeriesOfScalarType A representation of a series of scalar values of a feature.nElements The number of elements in the series. The default is one.

6.7.1.3 Description UseSeriesOfScalarType is a datatype for series of scalars. Descriptions are instantiated as subtypes that allow the data to be stored scaled by a constant factor, stored scaled by a variable factor, or stored at full resolution

The following is an example of the use of SeriesOfScalarType in a descriptor definition, AudioPowerType, which describes the time-averaged squared audio waveform.

<complexType name="AudioPowerType" base="AudioSampledType" derivedBy="extension"> <element name="Value" type="mds:SeriesOfScalarType" maxOccurs="unbounded"/></complexType>

AudioPowerType is defined as a subtype of AudioSampledType that specifies the sampling period of the descriptor when it was first calculated. The time series of descriptor samples is stored as a SeriesOfScalarType, which allows the temporal resolution to be further reduced. Multiple occurrences are allowed, so the data may be summarized at different resolutions (and with different scaling rules, see below).

6.7.1.4 Description ExampleSeriesOfScalar is abstract and is not instantiated. Description examples are given for subtypes below.

6.7.2 SOSScaledTypeThis is an abstract type that holds those fields that are common to its two subtypes SOSFixedScaleType and SOSFreeScaleType. Several fields may be present, representing different scaling operations applied to the same data with the same scale ratio. The subtypes specify the scale ratio.


<complexType name="SOSScaledType" base="mds:SeriesOfScalarType" derivedBy="extension" abstract="true"> <element name="Min" type="mds:FloatVectorType" minOccurs="0"/> <element name="Max" type="mds:FloatVectorType" minOccurs="0"/> <element name="Mean" type="mds:FloatVectorType" minOccurs="0"/> <element name="Random" type="mds:FloatVectorType" minOccurs="0"/>

<element name="First" type="mds:FloatVectorType" minOccurs="0"/> <element name="Last" type="mds:FloatVectorType" minOccurs="0"/> <element name="Variance" type="mds:FloatVectorType" minOccurs="0"/> <element name="Weight" type="mds:FloatVectorType" minOccurs="0"/></complexType>

6.7.2.2 Description SemanticsSemantics of SOSScaledType D:

Name DefinitionSOSScaledType An abstract type for scaled series of scalars.Mean Series of means of groups of samples. Size of series must equal 'nElements'.Min Series of minima of groups of samples. Size of series must equal 'nElements'.Max Series of maxima of groups of samples. Size of series must equal 'nElements'.Random Downsampled series (one sample selected at random from each group of

samples). Size of series must equal 'nElements'.First Downsampled series (first sample selected from each group of samples). Size of

series must equal 'nElements'.Last Downsampled series (last sample selected from each group of samples). Size of

series must equal 'nElements'.Variance Series of variances of groups of samples. Size of series must equal 'nElements'.Weight Optional series of weights. Contrary to other fields, these do not represent values

of the descriptor itself, but rather auxiliary weights to control scaling (see below). Size of series must equal 'nElements'.

6.7.2.3 Definitions of Scaling OperationsScaling is restricted to operations that are scalable in the following sense: if the series is scaled by a scale ratio of P, and then rescaled by a factor of Q, the result is the same as if the original series had been scaled by a scale ratio of N=PQ.

Name Definition Definition if Weight presentMean

If all samples have zero weight, set to zero by convention.

Min Ignore samples with zero weight. If all have zero weight, set to zero by convention.

Max Ignore samples with zero weight. If all have zero weight, set to zero by convention.

Random choose at random among N samples

Ignore weight

First choose the first of N samples Ignore weightLast choose the last of N samples Ignore weightVariance


Weight

In these formulae, k is an index in the scaled series, and i an index in the original series. N is the number of samples summarized by each scaled sample. Depending on the subtype (SOSFixedScaleType or SOSFreeScaleType) this number is fixed for all k (and a power of two) or variable (and an arbitrary positive integer). The formula for Variance differs from the standard formula for unbiased variance by the presence of N rather than N-1. This is purely for simplicity of calculation: unbiased variance is easy to derive from it. If the 'weight' field is present, the terms of the sum are weighted.

6.7.2.4 Description ExampleSOSScaledType is abstract and is not instantiated. Description examples are given for subtypes below.

6.7.3 SOSFixedScaleTypeUse this type to instantiate a series of scalars with a uniform—the entire series uses the same scale ratio—power-of-two scale ratio. The scale ratio is the number of samples of the original series represented by each sample of the scaled series. It defines the resolution of the scaled series, and this ratio is constrained to be a power of two.


<complexType name="SOSFixedScaleType" base="mds:SOSScaledType"

derivedBy="extension"> <element name="VarianceScalewise" type="mds:FloatMatrixType"

minOccurs="0"/> <attribute name="RootFirst" type="boolean" use="default"

value="false"/> <attribute name="ScaleRatio" type="positiveInteger" use="required"/> <attribute name="TotalSamples" type="positiveInteger"

use="required"/></complexType>

6.7.3.2 Description SemanticsName DefinitionSOSFixedScaleType A representation of a series of scalar values with reduced resolution.ScaleRatio Number of original samples represented by each sample of the scaled series. Must

be a power of two.VarianceScalewise Optional array of arrays of scalewise variance coefficients. Scalewise variance is a

decomposition of the variance into a series of coefficients, each of which describes the variability at a particular scale. There are log2(ScaleRatio) such coefficients. See definition below. Number of rows must equal 'nElements', number of columns must equal the number of coefficients of the scalewise variance.

RootFirst Optional flag. If true, the series are recoded in "root-first" format. This format is defined below. In brief: the recoded series starts with the grand mean (or min, or max, etc.) of the original series, and the subsequent values provide a progressively refined description from which the entire series can be reconstructed.

TotalSamples Total number of samples in the original series before it was scaled.

6.7.3.2.1 RootFirst formatRootFirst format is defined only for 'SOSFixedScaleType' (uniform sampling with power-of-two ratio). RootFirst format is a way of rearranging the coefficients so that they represent the original series in a "coarse-first, fine-last" fashion. Based on the previous binary mean tree, the coefficients of yk the rootfirst series are calculated as:

The binary mean tree (and therefore the original series) can be reconstructed from this series:

The first coefficient is the grand mean. The second is the difference between the means of the first and second half of the series, from which these two means can be calculated, etc.. RootFirst format may be useful to transmit a description over a slow network, for example to display a progressively-refined image of the descriptor. RootFirst format is defined only for the 'mean' field. If 'RootFirst' is true, only the 'mean' field is allowed.

Editor's Note: A RootFirst format may also be defined for 'min' and 'max' and possibly 'variance' (TBD). If such a definition is provided this restriction to 'mean' may be relaxed.

6.7.3.2.2 Scalewise VarianceScalewise variance is defined only for 'SOSFixedScaleType' (uniform sampling with power-of-two ratio). Scalewise variance is a decomposition of the variance into a vector of coefficients that describe variability at different scales. The sum of these coefficients equals the variance. To calculate the scalewise variance of a set of N=2^m samples, first recursively form a binary tree of means:

Then calculate the coefficients:

The vector formed by these coefficients is the scalewise variance for this group of samples. The 'SOSVarianceScalewise' field stores a series of such vectors.

6.7.3.3 Description ExamplesConsider the following series of 100 samples.

10 10 10 10 10 10 10 10 10 10 14 17 19 18 15 11 10 10 13 20 27 32 33 30 25 18 14 14 20 30 39 46 48 43 35 25 19 20 27 40 52 61 63 56 45 32 25 25 34 50 65 76 77 69 54 39 30 30 42 60 78 90 92 82 65 46 35 36 49 70 91 99 99 95 75 54 41 41 56 80 99 99 99 99 84 61 46 46 63 90 99 99 99 99 94 68 51 52 70 99

The descriptor definition specifies that the data should be stored in an element of type SeriesOfScalarType. To produce a description, one must choose a subtype for the element. We choose a fixed scale type:

<element name="MySeriesOfScalars" type="SOSFixedScaleType"/>

Given this declaration, the element may be instantiated from the original series in a variety of ways according to the needs of the application. In the following example the series is scaled to four elements. The first three are each averages of 32 samples of the original series, the last element is the average of the last 4 samples.

<MySeriesOfScalars NElements="4" ScaleRatio="32" TotalSamples="100"> <Mean Size="4"> 17.96 49.50 74.25 68 </Mean></MySeriesOfScalars>

In the following example the same series is summarized by four minima and maxima (the first three over 32 samples, the last over 4 samples).<MySeriesOfScalars NElements="4" ScaleRatio="32" TotalSamples="100"> <Min Size="4"> 10 19 35 51 </Min> <Max Size="4"> 46 92 99 99 </Max></MySeriesOfScalars>

Suppose that, in addition to the above series, a series of "weights" indicates which values are reliable and which are not. Such might be the case of a fundamental frequency (pitch) extractor. Here the weights are 0 or 1; other extractors might produce fractional values:

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0

Unreliable values should be deemphasized in mean or variance calculations, and ignored in min and max calculations. In the following example weight values were used in the scaling operations. The weights themselves are also stored, to allow further scaling. Note that values of Mean, Min and Max, while similar to unweighted values, are not the same.

<MySeriesOfScalars NElements="4" ScaleRatio="32" TotalSamples="100"> <Min Size="4"> 11 19 35 51 </Min> <Max Size="4"> 46 92 95 70 </Max> <Mean Size="4"> 22.75 49.50 63.00 57.67 </Mean> <Variance Size="4"> 87.7 432.3 369.5 76.22 </Variance> <Weight Size="4"> 0.63 1.00 0.69 0.75 </Weight></MySeriesOfScalars>

6.7.3.3.1 Example of RootFirst formatThis example summarizes the same series as previously. The only difference with the first of the previous examples is the presence of the RootFirst flag, and the fact that the elements are recoded in RootFirst format: the first element is the grand mean, the second element is the difference between the means of the first two and last two elements, etc.:

<MySeriesOfScalars NElements="4" RootFirst="true" ScaleRatio="32" TotalSamples="100"> <Mean Size="4"> 52.4275 37.3950 31.5400 -6.2500 </Mean></MySeriesOfScalars>

6.7.3.3.2 Example of Scalewise VarianceThis example summarizes the same series as previously by its grand mean together with the scalewise variance. The sum of scalewise variance coefficients equals the total variance.

<MySeriesOfScalars NElements="1" RootFirst="true" ScaleRatio="128" TotalSamples="100"> <Mean Size="1"> 52.4275 </Mean> <VarianceScalewise Size="1 4"> 48.07 297.1 261.8 29.56 179.1 82.62 68.11 </VarianceScalewise></MySeriesOfScalars>

6.7.4 SOSFreeScaleTypeThis type describes a series of scalars with variable scaling and arbitrary integer values (not just power-of-two) of the scaling ratio. The scale ratio may vary within the series.


<complexType name="SOSFreeScaleType" base="mds:SOSScaledType" derivedBy="extension">

<sequence minOccurs="1" maxOccurs="unbounded"> <element name="scaleRatio" type="positiveInteger"/> <element name="nElements" type="positiveInteger"/> </sequence></complexType>

6.7.4.2 Description SemanticsSemantics of the SOSFreeScaleType D:

Name DefinitionSOSFreeScaleType A representation of a series of scalar values with reduced resolution. Elements

occur in runs. Each run has its own scale ratio. The series may contain any number of runs.

scaleRatio Scale ratio value common to all the elements in a run. nElements Number of elements in a run.

6.7.4.3 Description ExamplesThese examples use the same series as above. The descriptor specifies a 'SeriesOfScalarType', and the application chooses to instantiate it with an element declared as a SOSFreeScaleType:

<element name="MySeriesOfScalars" type="SOSFreeScaleType"/>

Given this declaration, the element may be instantiated from the original series in a variety of ways. In the following example the series is scaled to four elements, each the average of 25 samples:

<MySeriesOfScalars NElements="4" TotalSamples="100"> <Mean Size="4"> 46.56 47.44 48.64 49.64 </Mean> <ScaleRatio> 25 </ScaleRatio> <NElements> 4 <NElements></MySeriesOfScalars>

In the following example the same series is scaled to seven elements. The first and last are means of 50 and 40 samples respectively. The other 5 are individual samples that describe a restricted portion of the series at full resolution:

<MySeriesOfScalars NElements="4" TotalSamples="100"> <Mean Size="7"> 25.50 65 76 77 69 54 70.91 </Mean> <ScaleRatio> 50 </ScaleRatio> <NElements> 1 <NElements> <ScaleRatio> 1 </ScaleRatio> <NElements> 5 <NElements> <ScaleRatio> 40 </ScaleRatio> <NElements> 1 <NElements></MySeriesOfScalars>

6.7.5 SOSUnscaledTypeThis type describes a series of data at full resolution. Instructions may be provided to specify how the data should be scaled when the need arises.


<complexType name="SOSUnscaledType" base="mds:SeriesOfScalarType" derivedBy="extension"> <element name="Data" type="mds:FloatVectorType" minOccurs="1"/> <element name="Weight" type="mds:floatVectorType" minOccurs="0"/>

 <attribute name="MeanFlag" type="boolean" use="default" value="false"/> <attribute name="MinFlag" type="boolean" use="default" value="false"/> <attribute name="MaxFlag" type="boolean" use="default"

value="false"/> <attribute name="RandomFlag" type="boolean" use="default" value="false"/> <attribute name="FirstFlag" type="boolean" use="default" value="false"/> <attribute name="LastFlag" type="boolean" use="default" value="false"/> <attribute name="VarianceFlag" type="boolean" use="default" value="false"/> <attribute name="VarianceScalewiseFlag" type="boolean" use="default" value="false"/></complexType>

6.7.5.2 Description SemanticsSemantics of the SOSUnscaledType DS:

Name DefinitionSOSUnscaledType A representation of a series of scalar values of a feature at full resolution (no

scaling).Data Values in the series. Size of array must equal 'nElements'.Weight Optional array of weights. Weights serve to control scaling. Size of array must

equal 'nElements'. The following are optional instructions for scaling. If present, these instructions specify how the data should be scaled, should the need for scaling arise. The scaled data are then stored in a scaled type (SOSFreeScaleType or SOSFixedScaleType).

MeanFlag If true, scale to series of mean.MinFlag If true, scale to series of min.MaxFlag If true, scale to series of max.RandomFlag If true, scale by choosing at random in each group of samples.FirstFlag If true, scale by choosing first of each group of samples.LastFlag If true, scale by choosing last of each group of samples.VarianceFlag If true, scale to series of variance.VarianceScalewiseFlag If true, scale to series of scalewise variance.

6.7.5.3 Description ExampleThese examples use the same series as before. The descriptor specifies a 'SeriesOfScalarType', storing the series at full resolution. The flags indicate that the series should be scaled by taking its mean and variance, should the need for scaling arise:

<element name="MySeriesOfScalars" type="SOSUnscaledType"/><MySeriesOfScalars NElements="100" MeanFlag="true" VarianceFlag="true"> <Data Size="100">

10 10 10 10 10 10 10 10 10 10 14 17 19 18 15 11 10 10 13 20 27 32 33 30 25 18 14 14 20 30 39 46 48 43 35 25 19 20 27 40 52 61 63 56 45 32 25 25 34 50 65 76 77 69 54 39 30 30 42 60 78 90 92 82 65 46 35 36 49 70 91 99 99 95 75 54 41 41 56 80 99 99 99 99 84 61 46 46 63 90 99 99 99 99 94 68 51 52 70 99 </Data>

</MySeriesOfScalars >

6.7.6 SeriesOfVectorTypeThis descriptor represents a abstract type for a general series of vectors. Applications will instantiate the series as one of several subtypes defined below.

6.7.6.1 Description Syntax

<complexType name="SeriesOfVectorType" abstract="true"> <attribute name="nElements" type="nonNegativeInteger" use="default"

value="1"/> <attribute name="VectorSize" type="positiveInteger" use="default"

value="1"/></complexType>

6.7.6.2 Description SemanticsSemantics of SeriesOfVectorType D:

Name DefinitionSeriesOfVectorType A representation of a series of vector values of a feature.nElements Number of elements in the series. The default is one.VectorSize Size of the vectors. Must be equal to the number of columns of the matrix types

used to store the series.

6.7.6.3 Description UseSeriesOfVectorType is a datatype for series of vectors. Descriptions are instantiated as subtypes that allow the data to be stored scaled by a constant factor, scaled by a variable factor, or at full resolution. SeriesOfVectorType automatically supports a range of scaling operations and statistics.

This is an example of the use of SeriesOfVectorType in a descriptor definition. AudioSpectrumEnvelopeType describes the time-averaged log-band power spectrum.

<complexType name="AudioSpectrumEnvelope" base="AudioSampledType" derivedBy="extension"> <element name="Value" type="mds:SeriesOfVectorType" /> </complexType>

The descriptor is defined as a subtype of AudioSampledType that specifies the original sampling period of the full-resolution descriptor. The time series of descriptor samples is stored as a SeriesOfScalarType. Multiple series are allowed so the data may be represented at different resolutions (possibly with different scaling rules).

6.7.6.4 Description ExampleSeriesOfVector is abstract and is not instantiated. Description examples are given for subtypes below.

6.7.7 SOVScaledTypeThis is an abstract type that holds those fields that are common to its two subtypes. Note that all fields are optional. They are instantiated at the discretion of the application software.


<complexType name="SOVScaledType" base="mds:SeriesOfVectorType"

derivedBy="extension" abstract="true"> <element name="Min" type="mds:FloatMatrixType" minOccurs="0"/> <element name="Max" type="mds:FloatMatrixType" minOccurs="0"/> <element name="Mean" type="mds:FloatMatrixType" minOccurs="0"/> <element name="Random" type="mds:FloatMatrixType" minOccurs="0"/> <element name="First" type="mds:FloatMatrixType" minOccurs="0"/> <element name="Last" type="mds:FloatMatrixType" minOccurs="0"/> <element name="Variance" type="mds:FloatMatrixType" minOccurs="0"/> <element name="Covariance" type="mds:FloatMatrixType" minOccurs="0"/> <element name="VarianceSummed" type="mds:FloatVectorType" minOccurs="0"/>

<element name="MaxSqDist" type="mds:FloatVectorType" minOccurs="0"/> <element name="Weight" type="mds:FloatVectorType" minOccurs="0"/></complexType>

6.7.7.2 Description SemanticsSemantics of the SOVScaleType D:Name DefinitionSOVScaledType An abstract type for scaled series of vectors.Mean Series of means of groups of samples. Number of rows must equal 'nElements',

number of columns must equal 'VectorSize'.Min Series of minima of groups of samples. Number of rows must equal 'nElements',

number of columns must equal 'VectorSize'.Max Series of maxima of groups of samples. Number of rows must equal 'nElements',

number of columns must equal 'VectorSize'.Random Downsampled series (one sample selected at random from each group of samples).

Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

First Downsampled series (first sample selected from each group of samples). Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

Last Downsampled series (last sample selected from each group of samples). Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

Variance Series of variance vectors of groups of vector samples. Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

Covariance Series of covariance matrices of groups of vector samples. This is a three-dimensionsional matrix. Number of rows must equal 'nElements', number of columns and number of pages must both equal 'VectorSize'.

VarianceSummed Series of summed variance coefficients of groups of samples. Size of array must equal 'nElements'.

MaxSqDist Series of coefficients representing an upper bound of the distance between groups of samples and their mean. Size of array must equal 'nElements'.

Weight Optional series of weights. Weights control downsampling of other fields (see explanation for SeriesOfScalars). Size of array must equal 'nElements'.

6.7.7.3 Definition of Scaling OperationsMost of operations are straightforward extensions of operations previously defined for series of scalars, applied uniformly to each dimension of the vectors. Operations that are specific to vectors are defined here.

Name Definition Definition if Weight presentVarianceSummed


MaxSqDist Ignore samples with zero weight. If all samples have zero weight, set to zero by convention

Covariance is calculated according to its standard definition with (N-1) replaced by N. The various variance/covariance options offer a choice of several cost/performance tradeoffs.

6.7.7.4 Description ExampleSOVScaledType is abstract and is not instantiated. Description examples are given for subtypes below.

6.7.8 SOVFixedScaleTypeThis descriptor describes a series of vectors with a uniform, power-of-two scale ratio. The scale ratio is the number of samples of the original series represented by each sample of the scaled series. It defines the resolution of the scaled series. The entire series uses the same scale ratio, and this ratio is constrained to be a power of two.

6.7.8.1 Description Syntax<complexType name="SOVFixedScaleType" base="mds:SOVScaledType" derivedBy="extension"> <element name="VarianceScalewise" type="mds:FloatMatrixType" minOccurs="0"/> <attribute name="RootFirst" type="boolean" use="default" value="false"/> <attribute name="ScaleRatio" type="positiveInteger" use="required"/> <attribute name="TotalSamples" type="positiveInteger" use="required"/></complexType>

6.7.8.2 Description SemanticsName DefinitionSOVFixedScaleType A representation of a reduced-resolution series of vector samples.ScaleRatio Number of original samples represented by each sample of the scaled series. Must

be a power of two.VarianceScalewise Array of arrays of scalewise summed-variance coefficients. Scalewise variance is

a decomposition of the variance into a series of coefficients, each of which describes the variability at a particular scale. Number of rows must equal 'nElements', number of columns must equal the number of coefficients of the scalewise variance.

RootFirst If true, the series are recoded in "root-first" format. See explanation for SOSFixedScaleType.

TotalSamples Total number of samples in the original series before it was scaled. Each sample is a vector.

6.7.8.3 Description ExamplesThe following examples assume that the descriptor extractor has produced a series of 128 vectors of 10 coefficients (not shown). The descriptor definition specifies that the data should be stored in an element of type SeriesOfVectorType. To produce a description, one must choose a subtype for the element. We choose a fixed scale type:

<element name="MySeriesOfVectors" type="SOVFixedScaleType"/>

Given this declaration, the element may be instantiated from the original series in a variety of ways according to the needs of the application. In the following example the series is scaled to four elements that each averages 32 samples of the original series:

<MySeriesOfVectors NElements="4" VectorSize="10" ScaleRatio="32" TotalSamples="128"> <Mean Size="4 10">

0.34 0.48 0.59 0.69 0.77 0.84 0.91 0.97 1 1.10.61 0.87 1.1 1.2 1.4 1.5 1.6 1.7 1.8 1.90.79 1.1 1.4 1.6 1.8 1.9 2.1 2.2 2.4 2.50.94 1.3 1.6 1.9 2.1 2.3 2.5 2.7 2.8 3</Mean>

</MySeriesOfVectors>

This example represents the same series of vectors represented by their grand mean, minimum, maximum and variance.

<MySeriesOfVectors VectorSize="10" ScaleRatio="128" TotalSamples="128"> <Mean Size="1 10"> 0.67 0.95 1.2 1.3 1.5 1.6 1.8 1.9 2 2.1 </Mean> <Min Size="1 10"> 0.1 0.14 0.17 0.2 0.22 0.2 0.26 0.2 0.3 0.32 </Min> <Max Size="1 10"> 1 1.4 1.7 2 2.2 2.4 2.6 2.8 3 3.2 </Max> <Variance Size="1 10"> 0.05 0.11 0.16 0.2 0.27 0.3 0.3 0.4 0.4 0.54 </Variance></MySeriesOfVectors>

This example summarizes the same series vectors represented by their covariance matrix.

<MySeriesOfVectors VectorSize="10" ScaleRatio="128" TotalSamples="128"> <Covariance Size="10 10">

0.054 0.077 0.094 0.11 0.12 0.13 0.14 0.15 0.16 0.170.077 0.11 0.13 0.15 0.17 0.19 0.2 0.22 0.23 0.240.094 0.13 0.16 0.19 0.21 0.23 0.25 0.27 0.28 0.3

0.11 0.15 0.19 0.22 0.24 0.27 0.29 0.31 0.32 0.34 0.12 0.17 0.21 0.24 0.27 0.3 0.32 0.34 0.36 0.38 0.13 0.19 0.23 0.27 0.3 0.32 0.35 0.38 0.4 0.42 0.14 0.2 0.25 0.29 0.32 0.35 0.38 0.41 0.43 0.45 0.15 0.22 0.27 0.31 0.34 0.38 0.41 0.43 0.46 0.48 0.16 0.23 0.28 0.32 0.36 0.4 0.43 0.46 0.49 0.51 0.17 0.24 0.3 0.34 0.38 0.42 0.45 0.48 0.51 0.54

</Covariance></MyMatrix1>

This example represents the same series vectors represented by a series of 4 VarianceSummed coefficients, each of which represents the variance (summed over vector coefficients) calculated separately over groups of 32 samples.

<MySeriesOfVectors NElements="4" VectorSize="10" ScaleRatio="32" TotalSamples="128"> <VarianceSummed Size="4"> 0.70 0.19 0.11 0.082 </VarianceSummed></MySeriesOfVectors>

6.7.9 SOVFreeScaleTypeThis descriptor describes a series of vectors with variable scaling and an arbitrary (not just power-of-two) scaling ratio.

6.7.9.1 Description Syntax<complexType name="SOVFreeScaleType" base="mds:SOVScaledType"

derivedBy="extension"> <sequence minOccurs="1" maxOccurs="unbounded"> <element name="scaleRatio" type="positiveInteger"/> <element name="nElements" type="positiveInteger"/> </sequence></complexType>


Name DefinitionSOVFreeScaleType A representation of a reduced-resolution series of vector samples. Elements occur in

runs. Each run has its own scale ratio. The series may contain any number of runs.scaleRatio Scale ratio value common to all the elements in a runnElements Number of elements in a run.

6.7.10 SOVUnscaledTypeUse this type to instantiate a series of vector data that are not (yet) scaled. Instructions may be provided to specify how the data should be scaled when the need arises.

6.7.10.1 Description Syntax<complexType name="SOVUnscaledType" base="mds:SeriesOfVectorType" derivedBy="extension"> <element name="Data" type="mds:FloatMatrixType" minOccurs="1"/>

<element name="Weight" type="mds:FloatVectorType" minOccurs="0"/>

<attribute name="MeanFlag" type="boolean" use="default" value="false"/>

<attribute name="MinFlag" type="boolean" use="default" value="false"/>

<attribute name="MaxFlag" type="boolean" use="default" value="false"/>

<attribute name="UniformFlag" type="boolean" use="default" value="false"/>

<attribute name="RandomFlag" type="boolean" use="default" value="false"/>

<attribute name="VarianceFlag" type="boolean" use="default" value="false"/>

<attribute name="CovarianceFlag" type="boolean" use="default" value="false"/>

<attribute name="VarianceSummedFlag" type="boolean" use="default" value="false"/>

<attribute name="VarianceScalewiseFlag" type="boolean" use="default" value="false"/>

<attribute name="MaxSqDistFlag" type="boolean" use="default" value="false"/>

</complexType>

6.7.10.2 Description SemanticsSemantics of the SOVUnscaledType D:

Name DefinitionSOVUnscaledType A representation of a series of vector values of a feature at full resolution (no

scaling).Data Values. Number of rows must equal 'nElements', number of columns must equal

'VectorSize'.Weight Optional array of weights. Weights serve to control scaling. Size of array must

equal 'nElements'.The following are optional instructions for scaling. If present, these instructions specify how the data should be scaled, should the need for scaling arise. The scaled data are then stored in a scaled type (SOVFreeScaleType or SOVFixedScaleType).

MeanFlag If true, scale to series of mean.MinFlag If true, scale to series of min.MaxFlag If true, scale to series of max.RandomFlag If true, scale by choosing at random in each group of samples.FirstFlag If true, scale by choosing first of each group of samples.LastFlag If true, scale by choosing last of each group of samples.VarianceFlag If true, scale to series of variance vectors.CovarianceFlag If true, scale to series of covariance matrices.VarianceSummedFlag If true, scale to series of summed variance coefficients.VarianceScalewiseFlag If true, scale to series of vectors of scalewise summed-variance coefficients.MaxSqDistFlag If true, scale to series of maximum squared distance coefficients.

6.7.11 Additional Informative Text

6.7.11.1 Examples of Applications

6.7.11.1.1 Example 1The application is a "music-lover's workbench", with software to produce, organize and edit audio data files. The volume of data (number and size of files) is enormous, but each file is labeled with a compact ".m7" file containing low level signal descriptors: waveform envelope, spectrum envelope, fundamental frequency and periodicity measures, modulation spectrum, etc. These allow any file (or group of files) to be displayed as a waveform or a spectrogram, and possibly also auralized. They also allow classification, search and comparison between files. Crossing the descriptors

with metadata such as "genre"allows disambiguation, etc. All of these low-level descriptors are based on ScalableSeries. The ScalableSeries support the necessary downsampling, and their statistics fields support the search and comparison functionalities.

6.7.11.1.2 Example 2The application is remote monitoring of parameters (weather, sesmic, plant process, etc.). Requirements are real-time network transfer of a low bandwidth representation for display, storage of a high-resolution buffer for detailed investigation on demand, robustness with respect to network failures. Data from the sensors are recorded at full resolution and stored in a circular buffer, and simultaneously a low-resolution scaled version is sent over the network (as an MPEG-7 stream). At regular intervals the circular buffer is transferred to disk (as an MPEG7 file). To avoid running out of disk space, the full-resolution files are regularly processed to obtain scaled files (also in MPEG-7 format). The oldest scaled files are themselves rescaled according to a schedule that ensures that disc space never runs out. This record contains a complete history of the parameter, with temporal resolution that is reduced for older data. A set of statistics such as extrema (min, max), mean, variance, etc. give useful information about the data that were discarded in the scaling process. The ScalableSeries support (a) the low-resolution representation for streaming over a low-bandwidth network, (b) the storage at full- or reduced-resolution, (c) the arbitrary scaling and rescaling to fit within a finite storage volume, (d) statistics to characterize aspects of the data that may have been lost due to scaling: extrema, mean, variability. An "MPEG-7 recorder" discovered by archaeologists a thousand years hence would contain a useful history of that entire time span.

6.7.11.2 Examples of search algorithms based on scalable seriesThese examples illustrate how scalable series may be used for efficient search. The purpose is not to define algorithms rigorously, but rather to give ideas on how search applications might be addressed. ScalableSeries allow storage of useful statistics to support search, comparison and clustering. The reduced-resolution data serve either as a surrogate of the original data if it is unavailable, or as a short-cut to the original data if it is available but expensive to access. They make search possible in the first case, and fast in the second.

As an example, consider a large database of audio. The task is to determine whether a given snippet of audio belongs to the database based on spectral similarity. A straightforward procedure is to scan the database, calculate spectra for each frame and compare them to spectra of the snippet. For an even moderately-sized database this is likely to be too expensive for two reasons: the high cost of data access/spectrum calculation, and the inefficiency of linear search. To reduce the former cost, the spectra may be precomputed, but this requires large amounts of storage to keep the spectra. Also it does not address the second cost, that of linear search through the spectra. ScalableSeries address both costs: they allow storage in a cheap low-resolution format that supports efficient search. Typically a feature may be extracted as a series of 'N' samples. This is scaled to 'm' elements that each summarizes a group of 'n' samples ( ). Search occurs in two steps: first among the 'm' elements for the 'q' best candidates, and then among the 'n' samples of each candidate. The ( )-step search is thus replaced by a ( )-step search. This is fast if q<<m, that is, if groups are labeled with information that allows the algorithm to decide whether they are good candidates. That is the purpose of the statistics afforded by ScalableSeries.

It is useful to think of the scaled series as being one layer of a tree. The leaves are the samples of the original full-resolution series, the root is a single summary value obtained by scaling the series by a scale ratio equal to its number of samples. To be specific we can suppose that it is a binary tree (n-ary trees may be more effective in practice). Starting from the series of 'N' samples, a (2N-1)-node balanced binary tree is formed by repeatedly applying the scaling operation. Search then proceeds from the root, with a cost that may be as small as O(log2(N)). In practice, the search algorithms may use only part of this tree, but to keep things simple they are described as using the full tree.

6.7.11.2.1 Search and comparison using Min and MaxThe 'SOSMin' and 'SOSMax' fields of a SeriesOfScalars store series of 'min' and 'max' of groups of samples. 'SOVMin' and 'SOVMax' fields of SeriesOfVector store similar information for groups of vectors in a series.

Suppose that a binary tree has been constructed based on samples of a series of scalars, each node containing a min-max pair summarizing the group of samples that it spans. The task is to determine whether a sample 'x' appears within the series.Algorithm:Starting from the root, at each node test whether 'x' is between 'min' and 'max'. If it is, recurse on both child nodes. If it is not, prune the subtree (ignore all child nodes).Search is fast if a large proportion of the tree is pruned. In the case that a single layer of 'm' nodes is available, the test is performed on each, and those for which it fails are discarded. The search then proceeds by accessing (or recalculating) the data spanned by the successful nodes. Search is fast if these are few.

Suppose that min-max binary trees have been constructed for two series (segments) of scalars 'A' and 'B'. The task is to determine whether segment A is included within segment B. The algorithm is based on the fact that the min and max of an interval are intermediate between the min and max of any interval that includes it.Algorithm: In a first step, tree B is processed by merging adjacent nodes of each layer. For layer j, node k is merged with node k+1: This step is necessary because intervals of groups of samples subsumed by nodes of trees A and B might not be aligned. The second step is to compare the min-max pair of the root of A to nodes of B, starting at the root. This process stops at the depth where each node of B spans an interval just large enough to contain the interval spanned by A. Nodes for which this test fails are pruned. The third step consists of comparing A and B layer by layer. For each layer of A, take the first node of and compare it to successive nodes of the corresponding layer of B, pruning those for which this test fails. If the test succeeds for a certain node of B, compare the next nodes of A to the next nodes of B. If this test succeeds, the comparison may then proceed to the next layer. If it fails, and if all nodes of B have been tested, the algorithm declares that A is not included in B.

Search is fast if the trees can be pruned rapidly. It is much faster than with other approaches such as sample-to-sample comparison or cross-correlation. The algorithm may be made yet faster (but more complex) by checking that candidate nodes of B are in included in A. It may be extended to test whether files A and B have a common intersection. The algorithm may be usefully applied to find duplicates among files, or in an editing application to identify common waveform segments.

6.7.11.2.2 Search and comparison using Mean and MaxSqDistConsider a series of vectors stored using SeriesOfVectorType. The 'SOVMean' and 'SOVMaxSqDist' fields represent, for each group of samples, their mean and an upper bound of the squared distance between this mean and each sample. This defines a hypersphere that is guaranteed to contain all samples. The algorithms defined for Min and Max can be extended straightforwardly to this case.

6.7.11.2.3 Search and comparison using Mean and VarianceAs a rough approximation, the distribution of scalar samples within a group may be modeled by a Gaussian distribution characterized by its mean and variance. For multidimensional data the covariance matrix is used in place of variance, but it may be approximated by its diagonal terms (vector of per-dimension variance terms) which can themselves be summarized by their sum. The covariance matrix characterizes an ellipsoidal approximation to the distribution, the variance vector characterizes the same with axes aligned to the dimensions, and their sum characterizes a spherical approximation. Whatever the quality of the approximation, it may be used to infer the probability that a particular search token belongs to the distribution. This allows effective search.As for search based on extrema (min-max, or maximum squared distance from mean), efficiency results from pruning the search tree (or ordering it to start the search in the most likely place). Pruning was deterministic for extrema, it is probabilistic in the case of mean and variance. For example, the search may decide to prune nodes that are distant by more than two or three standard deviations from the search token.Alternatively, the mean and variance/covariance may be used to calculate the Mahalanobis distance:

which may be used as a metric to compare scaled series. The variance itself may be used as a feature to support comparisons.

6.7.11.2.4 Search and comparison using Scalewise VarianceSearch using mean and variance may be efficient if the distance between groups is at least as large as intra-group standard deviation. Typically, when search proceeds within a tree starting from the root, the variance of nodes within the first layers is likely to be large compared to the between-node variance. It is only beyond a certain depth that the inter- to intranode variance becomes favorable. If this depth is large (the layer is close to the leaves) search will be expensive because there are many nodes in that layer. If it is small, search is cheap because large portions of the tree may be pruned quickly. Scalewise variance is useful because it indicates the distribution of variance across scales, and thus allows nodes that are expensive to search to be pruned or given low priority. Scalewise variance may also be used as a feature (it is similar to a logarithmic octave-band power spectrum) to support comparison between scaled series.

6.7.11.3 Algorithms for variable ratio scaling SOSFreeScaleType and SOVFreeScaleType allow a series to be scaled with a scale factor that varies within the series. This allows a rational usage of descriptor storage space (by avoiding long series of identical samples for the parts of a series that are constant). The scaling ratio can be set according to any desired criterion (ScalableSeries are agnostic as to how the parameter was set). The following procedures and criteria may be useful.

6.7.11.3.1 Variance-dependent samplingThe procedure is to sample the feature at intervals of size inversely proportional to variance (the sum of squared distances between samples and their mean is constant). This procedure guarantees that the sampling resolution is greater for portions that are highly variable. For audio data, silent portions would be stored with lower resolution than high-amplitude portions.

6.7.11.3.2 Constant variance criterionThe drawback of the previous scheme is that noisy portions of the series are likely to be overrepresented. An option is to merge samples that have the same variance, so that the highest resolution is available to represent changes in variance.

6.7.11.3.3 Ratio of between-group and within-group variance.Search among clusters of samples is most effective if between-cluster variance is at least as large as within-cluster variance. A useful procedure is thus to cut up the series so as to obtain a criterion ratio between intercluster and intracluster variance. The criterion can be estimated globally over the entire series, or locally (for example by adjusting the boundaries of two adjacent groups to obtain a criterion ratio between standard deviation and distance between means).

6.7.11.4 RescalingAll scaling operations have the property that they can be performed in any number of intermediate stages. The data may be scaled by a factor of N=mn, or first by m and then againg by n. This section gives some indications on rescaling.

6.7.11.4.1 MeanRescaling is performed by averaging adjacent means and updating 'ScaleRatio'. For example in the case of rescaling by a factor of two:

In the event where the scale ratio is variable, rescaling is performed by taking the average of adjacent means weighted by the numbers of samples they subsume. For example:

In the event where a weight field is present (e.g. 'SOSWeight'), operations are weighted by this factor also.

6.7.11.4.2 Min, maxRescaling is performed by taking the min (resp. max) of adjacent samples. For example in the case of rescaling by a factor of two:

In the event where a weight field is present (e.g. 'SOSWeight'), samples with zero weight are ignored in the min (resp. max) operation. If all samples involved have zero weight, the result also has zero weight (it is set to zero by convention).

6.7.11.4.3 VarianceRescaling is performed by taking the average of variances of adjacent samples, and adding the biased variance of the corresponding means. For example in the case of downsampling by a factor of two:

In the case that scale ratios are present, or a weight field is present, appropriate weights are used within these calculations. Scaling of variance requires the presence of the mean (if a SeriesOfScalars contains a 'Variance' field it must also contain a 'Mean' field).

6.7.11.4.4 Scalewise varianceScaling involves two operations. First, adjacent scalewise variance coefficients are averaged:

Then, a new coefficient is derived from the means of layer m:

Scaling of scalewise variance requires the presence of the mean (if a SeriesOfScalars contains a 'VarianceScalewise' field it must also contain a 'Mean' field).

6.7.11.4.5 Maximum distance from mean.

Scaling is performed by taking the max of adjacent samples of MSD (layer m-1), and adding to this the largest distance between adjacent samples of the mean (layer m-1), and the new mean (layer m). For example when downsampling by a factor of two:

Scalable series are datatypes for series of values (scalars or vectors). They allow the series to be scaled (downsampled) in a well-defined fashion. Two types are available for use in descriptors: SeriesOfScalarType and SeriesOfVectorType. They are useful in particular to build descriptors that contain time series of values.

Scalable series are designed to serve as building blocks for descriptors that may occur in series, such as low-level signal descriptors, histograms, statistics, etc. At full resolution such descriptors are likely to be expensive, so they may need to be scaled to a reduced resolution. Yet it is very difficult for the descriptor designer to decide the best scaling factor. A tradeoff must be found between cost and completeness, and the optimal tradeoff depends on factors that are known only to the application designer. In some cases the factors may be known only at runtime, and might even change dynamically. Scaled data may need to be rescaled to reduce storage or transmission costs.

Scalable series are designed for ease of use by descriptor designers. Just specify the datatypes 'SeriesOfScalarType' or 'SeriesOfVectorType' for any data that may occur as series. Scalable series are also easy to use by application designers. When creating the description: extract the descriptor as a series of values, scale if necessary, store the values and enter the appropriate scaling information. When rescaling the series, just make sure to respect the appropriate scaling rules. When using the description just read out the values. Some of these operations may be supported by generic routines.

Scalable series are efficient, flexible and powerful: Efficient, because storage overhead is minimal and all operations (creation, scaling, etc.) are fast; flexible, because any reasonable scaling operation is possible, at any time; and powerful, because a number of optional fields (such as 'Variance') are available to support advanced applications such as search.

6.7.12 SeriesOfScalarType;

This descriptor represents a general series of scalars. Use this abstract type within descriptor definitions. Applications will instantiate the series as one of several subtypes defined below.


<complexType name="SeriesOfScalarType" abstract="true"> <attribute name="nElements" type="nonNegativeInteger" use="default" value="1"/></complexType>


Name DefinitionSeriesOfScalarType A representation of a series of scalar values of a feature.nElements The number of elements in the series. The default is one.

6.7.12.3 Description UseSeriesOfScalarType is a datatype for series of scalars. The number of samples in the series is stored in 'nElements'. This defaults to 1, so the descriptor can be used to store a single value with no overhead. Actual data are are instantiated as subtypes of SeriesOfScalarType that allow the data to be stored scaled by a constant factor, scaled by a variable factor, or at full resolution. Scaling information is encapsulated so the designer of a MPEG-7 descriptor need not worry about it. SeriesOfScalarType automatically supports a range of scaling operations and statistics.

6.7.12.4 Description ExampleThe following is an example of the use of SeriesOfScalarType in a descriptor definition, AudioPowerType, which describes the time-averaged squared audio waveform.

<complexType name="AudioPowerType" base="AudioSampledType"

derivedBy="extension"> <element name="Value" type="mds:SeriesOfScalarType"

maxOccurs="unbounded"/></complexType>

The descriptor is defined as a subtype of AudioSampledType that specifies the original sampling period of the full-resolution descriptor. The time series of descriptor samples is stored as a SeriesOfScalarType, which allows the temporal resolution to be further reduced. Multiple occurrences are allowed, so the data may be represented at different resolutions (possibly with different scaling rules, see below).


6.7.12.5.1 Example 1The application is a "music-lover's workbench", with software to produce, organize and edit audio data files. The volume of data (number and size of files) is enormous, but each file is labeled with a compact ".m7" file containing low level signal descriptors: waveform envelope, spectrum envelope, fundamental frequency and periodicity measures, modulation spectrum, etc. These allow any file (or group of files) to be displayed as a waveform or a spectrogram, and possibly also auralized. They also allow classification, search and comparison between files. Crossing the descriptors with metadata such as "genre"allows disambiguation, etc. All of these low-level descriptors are based on ScalableSeries. The ScalableSeries support the necessary downsampling, and their statistics fields support the search and comparison functionalities.

6.7.12.5.2 Example 2The application is remote monitoring of parameters (weather, sesmic, plant process, etc.). Requirements are real-time network transfer of a low bandwidth representation for display, storage of a high-resolution buffer for detailed investigation on demand, robustness with respect to network failures. Data from the sensors are recorded at full resolution and stored in a circular buffer, and simultaneously a low-resolution scaled version is sent over the network (as an MPEG-7 stream). At regular intervals the circular buffer is transferred to disk (as an MPEG7 file). To avoid running out of disk space, the full-resolution files are regularly processed to obtain scaled files (also in MPEG-7 format). The oldest scaled files are themselves rescaled according to a schedule that ensures that disc space never runs out. This record contains a complete history of the parameter, with temporal resolution that is reduced for older data. A set of statistics such as extrema (min, max), mean, variance, etc. give useful information about the data that were discarded in the scaling process. The ScalableSeries support (a) the low-resolution representation for streaming over a low-bandwidth network, (b) the storage at full- or reduced-resolution, (c) the arbitrary scaling and rescaling to fit within a finite storage volume, (d) statistics to characterize aspects of the data that may have been lost due to scaling: extrema, mean, variability. An "MPEG-7 recorder" discovered by archaeologists a thousand years hence would contain a useful history of that entire time span.

6.7.12.5.3 Examples of search algorithms based on scalable seriesThese examples illustrate how scalable series may be used for efficient search. The purpose is not to define algorithms rigorously, but rather to give ideas on how search applications might be addressed. ScalableSeries allow storage of useful statistics to support search, comparison and clustering. The reduced-resolution data serve either as a surrogate of the original data if it is unavailable, or as a short-cut to the original data if it is available but expensive to access. They make search possible in the first case, and fast in the second.

As an example, consider a large database of audio. The task is to determine whether a given snippet of audio belongs to the database based on spectral similarity. A straightforward procedure is to scan the database, calculate spectra for each frame and compare them to spectra of the snippet. For an even moderately-sized database this is likely to be too expensive for two reasons: the high cost of data access/spectrum calculation, and the inefficiency of linear search. To reduce the former cost, the spectra may be precomputed, but this requires large amounts of storage to keep the spectra. Also it does not address the second cost, that of linear search through the spectra. ScalableSeries address both costs: they allow storage in a cheap low-resolution format that supports efficient search. Typically a feature may be extracted as a series of 'N' samples. This is scaled to 'm' elements that each summarizes a group of 'n' samples ( ). Search occurs in two steps: first among the 'm' elements for the 'q' best candidates, and then among the 'n' samples of

each candidate. The ( )-step search is thus replaced by a ( )-step search. This is fast if q<<m, that is, if groups are labeled with information that allows the algorithm to decide whether they are good candidates. That is the purpose of the statistics afforded by ScalableSeries.

It is useful to think of the scaled series as being one layer of a tree. The leaves are the samples of the original full-resolution series, the root is a single summary value obtained by scaling the series by a scale ratio equal to its number of samples. To be specific we can suppose that it is a binary tree (n-ary trees may be more effective in practice). Starting from the series of 'N' samples, a (2N-1)-node balanced binary tree is formed by repeatedly applying the scaling operation. Search then proceeds from the root, with a cost that may be as small as O(log2(N)). In practice, the search algorithms may use only part of this tree, but to keep things simple they are described as using the full tree.

6.7.12.5.3.1 Search and comparison using Min and MaxThe 'SOSMin' and 'SOSMax' fields of a SeriesOfScalars store series of 'min' and 'max' of groups of samples. 'SOVMin' and 'SOVMax' fields of SeriesOfVector store similar information for groups of vectors in a series.

Suppose that a binary tree has been constructed based on samples of a series of scalars, each node containing a min-max pair summarizing the group of samples that it spans. The task is to determine whether a sample 'x' appears within the series.Algorithm:Starting from the root, at each node test whether 'x' is between 'min' and 'max'. If it is, recurse on both child nodes. If it is not, prune the subtree (ignore all child nodes).Search is fast if a large proportion of the tree is pruned. In the case that a single layer of 'm' nodes is available, the test is performed on each, and those for which it fails are discarded. The search then proceeds by accessing (or recalculating) the data spanned by the successful nodes. Search is fast if these are few.

Suppose that min-max binary trees have been constructed for two series (segments) of scalars 'A' and 'B'. The task is to determine whether segment A is included within segment B. The algorithm is based on the fact that the min and max of an interval are intermediate between the min and max of any interval that includes it.Algorithm: In a first step, tree B is processed by merging adjacent nodes of each layer. For layer j, node k is merged with node k+1: This step is necessary because intervals of groups of samples subsumed by nodes of trees A and B might not be aligned. The second step is to compare the min-max pair of the root of A to nodes of B, starting at the root. This process stops at the depth where each node of B spans an interval just large enough to contain the interval spanned by A. Nodes for which this test fails are pruned. The third step consists of comparing A and B layer by layer. For each layer of A, take the first node of and compare it to successive nodes of the corresponding layer of B, pruning those for which this test fails. If the test succeeds for a certain node of B, compare the next nodes of A to the next nodes of B. If this test succeeds, the comparison may then proceed to the next layer. If it fails, and if all nodes of B have been tested, the algorithm declares that A is not included in B.

Search is fast if the trees can be pruned rapidly. It is much faster than with other approaches such as sample-to-sample comparison or cross-correlation. The algorithm may be made yet faster (but more complex) by checking that candidate nodes of B are in included in A. It may be extended to test whether files A and B have a common intersection. The algorithm may be usefully applied to find duplicates among files, or in an editing application to identify common waveform segments.

6.7.12.5.3.2 Search and comparison using Mean and MaxSqDistConsider a series of vectors stored using SeriesOfVectorType. The 'SOVMean' and 'SOVMaxSqDist' fields represent, for each group of samples, their mean and an upper bound of the squared distance between this mean and each sample. This defines a hypersphere that is guaranteed to contain all samples. The algorithms defined for Min and Max can be extended straightforwardly to this case.

6.7.12.5.3.3 Search and comparison using Mean and VarianceAs a rough approximation, the distribution of scalar samples within a group may be modeled by a Gaussian distribution characterized by its mean and variance. For multidimensional data the covariance matrix is used in place of variance, but it may be approximated by its diagonal terms (vector of per-dimension variance terms) which can themselves be summarized by their sum. The covariance matrix characterizes an ellipsoidal approximation to the distribution, the variance vector characterizes the same with axes aligned to the dimensions, and their sum characterizes a spherical approximation. Whatever the quality of the approximation, it may be used to infer the probability that a particular search token belongs to the distribution. This allows effective search.As for search based on extrema (min-max, or maximum squared distance from mean), efficiency results from pruning the search tree (or ordering it to start the search in the most likely place). Pruning was deterministic for extrema, it is

probabilistic in the case of mean and variance. For example, the search may decide to prune nodes that are distant by more than two or three standard deviations from the search token.Alternatively, the mean and variance/covariance may be used to calculate the Mahalanobis distance:

which may be used as a metric to compare scaled series. The variance itself may be used as a feature to support comparisons.

6.7.12.5.3.4 Search and comparison using Scalewise VarianceSearch using mean and variance may be efficient if the distance between groups is at least as large as intra-group standard deviation. Typically, when search proceeds within a tree starting from the root, the variance of nodes within the first layers is likely to be large compared to the between-node variance. It is only beyond a certain depth that the inter- to intranode variance becomes favorable. If this depth is large (the layer is close to the leaves) search will be expensive because there are many nodes in that layer. If it is small, search is cheap because large portions of the tree may be pruned quickly. Scalewise variance is useful because it indicates the distribution of variance across scales, and thus allows nodes that are expensive to search to be pruned or given low priority. Scalewise variance may also be used as a feature (it is similar to a logarithmic octave-band power spectrum) to support comparison between scaled series.

6.7.12.6 Description Extraction SOSFreeScaleType and SOVFreeScaleType allow a series to be scaled with a scale factor that varies within the series. This allows a rational usage of descriptor storage space (by avoiding long series of identical samples for the parts of a series that are constant). The scaling ratio can be set according to any desired criterion (ScalableSeries are agnostic as to how the parameter was set). The following procedures and criteria may be useful.

6.7.12.6.1 Variance-dependent samplingThe procedure is to sample the feature at intervals of size inversely proportional to variance (the sum of squared distances between samples and their mean is constant). This procedure guarantees that the sampling resolution is greater for portions that are highly variable. For audio data, silent portions would be stored with lower resolution than high-amplitude portions.

6.7.12.6.2 Constant variance criterionThe drawback of the previous scheme is that noisy portions of the series are likely to be overrepresented. An option is to merge samples that have the same variance, so that the highest resolution is available to represent changes in variance.

6.7.12.6.3 Ratio of between-group and within-group variance.Search among clusters of samples is most effective if between-cluster variance is at least as large as within-cluster variance. A useful procedure is thus to cut up the series so as to obtain a criterion ratio between intercluster and intracluster variance. The criterion can be estimated globally over the entire series, or locally (for example by adjusting the boundaries of two adjacent groups to obtain a criterion ratio between standard deviation and distance between means).

6.7.12.6.4 RescalingAll scaling operations have the property that they can be performed in any number of intermediate stages. The data may be scaled by a factor of N=mn, or first by m and then againg by n. This section gives some indications on rescaling.

6.7.12.6.4.1 MeanRescaling is performed by averaging adjacent means and updating 'ScaleRatio'. For example in the case of rescaling by a factor of two:

In the event where the scale ratio is variable, rescaling is performed by taking the average of adjacent means weighted by the numbers of samples they subsume. For example:

In the event where a weight field is present (e.g. 'SOSWeight'), operations are weighted by this factor also.

6.7.12.6.4.2 Min, maxRescaling is performed by taking the min (resp. max) of adjacent samples. For example in the case of rescaling by a factor of two:

In the event where a weight field is present (e.g. 'SOSWeight'), samples with zero weight are ignored in the min (resp. max) operation. If all samples involved have zero weight, the result also has zero weight (it is set to zero by convention).

6.7.12.6.4.3 VarianceRescaling is performed by taking the average of variances of adjacent samples, and adding the biased variance of the corresponding means. For example in the case of downsampling by a factor of two:

In the case that scale ratios are present, or a weight field is present, appropriate weights are used within these calculations. Scaling of variance requires the presence of the mean (if a SeriesOfScalars contains a 'Variance' field it must also contain a 'Mean' field).

6.7.12.6.4.4 Scalewise varianceScaling involves two operations. First, adjacent scalewise variance coefficients are averaged:

Then, a new coefficient is derived from the means of layer m:

Scaling of scalewise variance requires the presence of the mean (if a SeriesOfScalars contains a 'VarianceScalewise' field it must also contain a 'Mean' field).

6.7.12.6.4.5 Maximum distance from mean.

Scaling is performed by taking the max of adjacent samples of MSD (layer m-1), and adding to this the largest distance between adjacent samples of the mean (layer m-1), and the new mean (layer m). For example when downsampling by a factor of two:

6.7.13 SOSScaledTypeThis is an abstract type that holds those fields that are common to its two subtypes SOSFixedScaleType and SOSFreeScaleType. Note that all fields are optional, instantiated at the discretion of the application software. Several fields may be present, representing different scaling operations applied to the same data with the same scale ratio. The subtypes specify the scale ratio.


<complexType name="SOSScaledType" base="mds:SeriesOfScalarType" derivedBy="extension" abstract="true"> <element name="Min" type="mds:FloatVector" minOccurs="0"/> <element name="Max" type="mds:FloatVector" minOccurs="0"/> <element name="Mean" type="mds:FloatVector" minOccurs="0"/> <element name="Random" type="mds:FloatVector" minOccurs="0"/> <element name="First" type="mds:FloatVector" minOccurs="0"/> <element name="Last" type="mds:FloatVector" minOccurs="0"/> <element name="Variance" type="mds:FloatVector" minOccurs="0"/> <element name="Weight" type="mds:FloatVector" minOccurs="0"/></complexType>

6.7.13.2 Description SemanticsSemantics of SOSScaledType D:

Name Definition

SOSScaledType An abstract type for scaled series of scalars.Mean Series of means of groups of samples. Size of series must equal 'nElements'.Min Series of minima of groups of samples. Size of series must equal 'nElements'.Max Series of maxima of groups of samples. Size of series must equal 'nElements'.Random Downsampled series (one sample selected at random from each group of

samples). Size of series must equal 'nElements'.First Downsampled series (first sample selected from each group of samples). Size of

series must equal 'nElements'.Last Downsampled series (last sample selected from each group of samples). Size of

series must equal 'nElements'.Variance Series of variances of groups of samples. Size of series must equal 'nElements'.Weight Optional series of weights. Contrary to other fields, these do not represent values

of the descriptor itself, but rather auxiliary weights to control scaling (see below). Size of series must equal 'nElements'.

6.7.13.3 Definitions of Scaling OperationsScaling is restricted to operations such that the result of scaling the original by a scale ratio of N=PQ is the same as first scaling the original series by P, and then scaling this scaled series by Q.

6.7.13.3.1 meanEach scaled sample is calculated as the mean of original samples , where is the scale ratio. For a constant scale ratio N:

If the 'weight' field is present, scaled samples are calculated as a weighted mean. If all samples have zero weight the mean is set to zero by convention (its value is indifferent as it has zero weight).

6.7.13.3.2 min, maxEach scaled sample is calculated as min (or max) of the original samples:

If the 'weight' field is present, samples for which are ignored in the min and max calculation. If all samples have zero weight, mean and max are set to zero by convention (their value is indifferent as it has zero weight).

6.7.13.3.3 randomEach scaled sample is obtained by selecting at random one sample from each group of samples. If the 'weight' field is present it is ignored.

6.7.13.3.4 first, last Each scaled sample is obtained by selecting the first (or last) sample from each group of samples. If the 'weight' field is present it is ignored.

6.7.13.3.5 variance Each sample of the 'variance' fieldis calculated as the (biased) variance of groups of samples of the original series. For a constant scale ratio N:

This formula differs from the standard formula for unbiased variance by the presence of N rather than N-1. This is purely for simplicity of calculation: unbiased variance is easy to derive from it. If the 'weight' field is present, the terms

of the sum are weighted. If all samples have zero weight the variance is set to zero by convention (its value is indifferent as it has zero weight).

A 'mean' field should be instantiated each time a 'variance' field is instantiated, as it is required for further downsampling.

6.7.13.3.6 weightThe 'weight' field serves to control scaling of the main data fields. The 'weight' field itself is scaled by taking the mean:

6.7.14 SOSFixedScaleTypeUse this type to instantiate a series of scalars with a uniform, power-of-two scale ratio. The scale ratio is the number of samples of the original series represented by each sample of the scaled series. It defines the resolution of the scaled series. The entire series uses the same scale ratio, and this ratio is constrained to be a power of two. This constraint simplifies the semantics. Use this type unless the application requires the more flexible (and complex) SOSFreeScaleType.


<complexType name="SOSFixedScaleType" base="mds:SOSScaledType"

derivedBy="extension"> <element name="VarianceScalewise" type="mds:FloatMatrix"

minOccurs="0"/> <attribute name="RootFirst" type="boolean" use="default"

value="false"/> <attribute name="ScaleRatio" type="positiveInteger" use="required"/> <attribute name="TotalSamples" type="positiveInteger"

use="required"/></complexType>

6.7.14.2 Description SemanticsName DefinitionSOSFixedScaleType A representation of a series of scalar values with reduced resolution.ScaleRatio Number of original samples represented by each sample of the scaled series. Must

be a power of two.VarianceScalewise Optional array of arrays of scalewise variance coefficients. Scalewise variance is a

decomposition of the variance into a series of coefficients, each of which describes the variability at a particular scale. There are log2(ScaleRatio) such coefficients. See definition below. Number of rows must equal 'nElements', number of columns must equal the number of coefficients of the scalewise variance.

RootFirst Optional flag. If true, the series are recoded in "root-first" format. This format is defined below. In brief: the recoded series starts with the grand mean (or min, or max, etc.) of the original series, and the subsequent values provide a progressively refined description from which the entire series can be reconstructed.

TotalSamples Total number of samples in the original series before it was scaled.

6.7.14.3 Definition of scalewise variance

6.7.14.4 Scalewise variance is defined only for 'SOSFixedScaleType' (uniform sampling with power-of-two ratio). Scalewise variance is a decomposition of the variance into a vector of coefficients that describe variability at different scales. The sum of these coefficients equals the variance. To calculate the scalewise variance of a set of N=2^m samples, first recursively form a binary tree of means:

6.7.14.5

6.7.14.6

6.7.14.7

6.7.14.8

6.7.14.9 Then calculate the coefficients:

6.7.14.10

6.7.14.11

6.7.14.12

6.7.14.13

6.7.14.14 The vector formed by these coefficients is the scalewise variance for this group of samples. The 'SOSVarianceScalewise' field stores a series of such vectors.

6.7.14.15 Definition of 'RootFirst' format

6.7.14.16 RootFirst format is defined only for 'SOSFixedScaleType' (uniform sampling with power-of-two ratio). RootFirst format is a way of rearranging the coefficients so that they represent the original series in a "coarse-first, fine-last" fashion. Based on the previous binary mean tree, the coefficients of yk the rootfirst series are calculated as:

6.7.14.17

6.7.14.18 The binary mean tree (and therefore the original series) can be reconstructed from this series:

6.7.14.19

6.7.14.20 The first coefficient is the grand mean. The second is the difference between the means of the first and second half of the series, from which these two means can be calculated, etc.. RootFirst format may be useful to transmit a description over a slow network, for example to display a progressively-refined image of the descriptor.

6.7.14.21 RootFirst format is defined only for the 'mean' field. If 'RootFirst' is true, only the 'mean' field is allowed.

6.7.14.22

6.7.14.23 Editro's Note: A RootFirst format may also be defined for 'min' and 'max' and possibly 'variance' (TBD). If such a definition is provided this restriction to 'mean' may be relaxed.

6.7.14.24

6.7.14.25 SOSFreeScaleType

6.7.14.26 Use this type to instantiate a series of scalars with variable scaling and arbitrary integer values (not just power-of-two) of the scaling ratio. The scale ratio may vary within the series. SOSFreeScaleType may allow a better usage of descriptor storage space, by avoiding long series of identical samples for the parts of a series that are constant. The scaling ratio can be set according to any criterion (ScalableSeries are agnostic as to how the parameter was set).


6.7.14.28

6.7.14.29 

6.7.14.30 <complexType name="SOSFreeScaleType" base="mds:SOSScaledType" derivedBy="extension">

6.7.14.31 <sequence minOccurs="1" maxOccurs="unbounded">

6.7.14.32 <element name="scaleRatio" type="positiveInteger"/>

6.7.14.33 <element name="nSamples" type="positiveInteger"/>

6.7.14.34 </sequence>

6.7.14.35 </complexType>


6.7.14.37 Semantics of the SOSFreeScaleType D:

6.7.14.38Name DefinitionSOSFreeScaleType A representation of a series of scalar values with reduced resolution. Samples occur

in runs. Each run has its own scale ratio. The series may contain any number of runs.

scaleRatio Scale ratio value common to all the samples in a run. nSamples Number of samples in a run.

6.7.15 SOSUnscaledTypeUse this type to instantiate a series of data at full resolution. Instructions may be provided to specify how the data should be scaled when the need arises. It is also possible to use a scaled type with ScaleRatio=1, instead of SOSUnscaledType, but that option is wasteful if several scaled fields are included.


<complexType name="SOSUnscaledType" base="mds:SeriesOfScalarType" derivedBy="extension"> <element name="Data" type="mds:FloatVector" minOccurs="1"/> <element name="Weight" type="mds:floatVector" minOccurs="0"/>Editor's note: minOccurs is not valid in attributeGroup <attributeGroup ref="SOSScalingInstructions" minOccurs="0"/></complexType>

<attributeGroup name="SOSScalingInstructions"> <attribute name="MeanFlag" type="boolean" use="default" value="false"/> <attribute name="MinFlag" type="boolean" use="default" value="false"/> <attribute name="MaxFlag" type="boolean" use="default" value="false"/> <attribute name="RandomFlag" type="boolean" use="default" value="false"/> <attribute name="FirstFlag" type="boolean" use="default" value="false"/> <attribute name="LastFlag" type="boolean" use="default" value="false"/> <attribute name="VarianceFlag" type="boolean" use="default" value="false"/> <attribute name="VarianceScalewiseFlag" type="boolean" use="default" value="false"/></attributeGroup>

6.7.15.2 Description SemanticsSemantics of the SOSUnscaledType DS:

Name DefinitionSOSUnscaledType A representation of a series of scalar values of a feature at full resolution (no

scaling).Data Values in the series. Size of array must equal 'nElements'.Weight Optional array of weights. Weights serve to control scaling. Size of array must

equal 'nElements'. SOSScalingInstructions Optional instructions for scaling. If present, these instructions specify how the

data should be scaled, should the need for scaling arise. The scaled data are then stored in a scaled type (SOSFreeScaleType or SOSFixedScaleType).

MeanFlag If true, scale to series of mean.MinFlag If true, scale to series of min.MaxFlag If true, scale to series of max.RandomFlag If true, scale by choosing at random in each group of samples.FirstFlag If true, scale by choosing first of each group of samples.LastFlag If true, scale by choosing last of each group of samples.VarianceFlag If true, scale to series of variance.VarianceScalewiseFlag If true, scale to series of scalewise variance.

6.7.16 SeriesOfVectorTypeThis descriptor represents a general series of vectors. Use this abstract type within descriptor definitions. Applications will instantiate the series as one of several subtypes defined below.

6.7.16.1 Description Syntax<complexType name="SeriesOfVectorType" abstract="true"> <attribute name="nElements" type="nonNegativeInteger" use="default"

value="1"/> <attribute name="VectorSize" type="positiveInteger" use="default"

value="1"/></complexType>

6.7.16.2 Description SemanticsSemantics of SeriesOfVectorType D:

Name DefinitionSeriesOfVectorType A representation of a series of vector values of a feature.nElements Number of elements in the series. The default is one.VectorSize Size of the vector. Must be equal to the number of columns of the matrix types used

to store the series.


6.7.16.4 SeriesOfVectorType is a datatype for series of vectors. The number of samples (each a vector) in the series is stored in 'nElements'. This defaults to 1, so the descriptor can be used to store a single sample with no overhead. Actual data are instantiated as subtypes of SeriesOfVectorType, that allow the data to be stored scaled by a constant factor, scaled by a variable factor, or at full resolution.

6.7.16.5

6.7.16.6 Scaling information is encapsulated so the designer of a MPEG-7 descriptor need not worry about it. SeriesOfVectorType automatically supports a range of scaling operations and statistics that are useful for applications.

6.7.16.7 Example of use

6.7.16.8 This is an example of the use of SeriesOfVectorType in a descriptor definition. AudioSpectrumEnvelopeType describes the time-averaged log-band power spectrum. This is an abbreviated definition.

6.7.16.9

6.7.16.10 

6.7.16.11 <complexType name="AudioSpectrumEnvelope" base="AudioSampledType"

6.7.16.12 derivedBy="extension">

6.7.16.13 <element name="Value" type="mds:SeriesOfVectorType" />

6.7.16.14 


6.7.16.16

6.7.16.17 The descriptor is defined as a subtype of AudioSampledType that specifies the original sampling period of the full-resolution descriptor. The time series of descriptor samples is stored as a SeriesOfScalarType that allows the temporal resolution to be further reduced. Multiple series are allowed so the data may be represented at different resolutions (possibly with different scaling rules).

6.7.16.18 SOVScaledType

6.7.16.19 This is an abstract type that holds those fields that are common to its two subtypes. Note that all fields are optional. They are instantiated at the discretion of the application software.


6.7.16.21

6.7.16.22 Editor's Note: This definition uses the Float3dTtype in the element for covariance. For the moment, it can be replaced with “FloatMatrix".

6.7.16.23

6.7.16.24 

6.7.16.25 <complexType name="SOVScaledType" base="mds:SeriesOfVectorType"

6.7.16.26 derivedBy="extension" abstract="true">

6.7.16.27 <element name="Min" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.28 <element name="Max" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.29 <element name="Mean" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.30 <element name="Random" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.31 <element name="First" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.32 <element name="Last" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.33 <element name="Variance" type="mds:FloatMatrix" minOccurs="0"/>

6.7.16.34 <element name="Covariance" type="mds:Float3DMatrix" minOccurs="0"/>

6.7.16.35 <element name="VarianceSummed" type="mds:FloatVector" minOccurs="0"/>

6.7.16.36 <element name="MaxSqDist" type="mds:FloatVector" minOccurs="0"/>

6.7.16.37 <element name="Weight" type="mds:FloatVector" minOccurs="0"/>



6.7.16.40 Semantics of the SOVScaleType D:Name DefinitionSOVScaledType An abstract type for scaled series of vectors.Mean Series of means of groups of samples. Number of rows must equal 'nElements',

number of columns must equal 'VectorSize'.Min Series of minima of groups of samples. Number of rows must equal 'nElements',

number of columns must equal 'VectorSize'.Max Series of maxima of groups of samples. Number of rows must equal 'nElements',

number of columns must equal 'VectorSize'.Random Downsampled series (one sample selected at random from each group of samples).

Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

First Downsampled series (first sample selected from each group of samples). Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

Last Downsampled series (last sample selected from each group of samples). Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

Variance Series of variance vectors of groups of vector samples. Number of rows must equal 'nElements', number of columns must equal 'VectorSize'.

Covariance Series of covariance matrices of groups of vector samples. Number of rows must equal 'nElements', number of columns and pages must equal 'VectorSize'.

VarianceSummed Series of summed variance coefficients of groups of samples. Size of array must equal 'nElements'.

MaxSqDist Series of coefficients representing an upper bound of the distance between groups of samples and their mean. Size of array must equal 'nElements'.

Weight Optional series of weights. Weights control downsampling of other fields (see

explanation for SeriesOfScalars). Size of array must equal 'nElements'.

6.7.16.41 Definition of Scaling OperationsScaling operations are restricted to operations such that the result of scaling by a factor of N=PQ is the same as first scaling the original series by P, and then scaling this scaled series by Q. Most of these operations are straightforward extensions of operations previously defined for series of scalars, applied uniformly to each dimension of the vectors. Operations that are specific to vectors are defined here.

6.7.16.41.1varianceFor series of vectors, several sorts of variance statistics are available. The 'SOVVariance' field stores a series of vectors of biased variance coefficients, calculated independently for each dimension of the original vector series. The 'SOVVarianceSummed' stores the series of sums of such variance coefficients. The 'SOVCovariance' field stores a series of complete covariance matrices. These statistics offer a choice of several cost/performance tradeoffs. Definitions are straightforward extensions of the scalar case. For example samples of 'SOVVarianceSummed' are calculated as:

where D is the number of dimensions of the vector and is the jth component of the kth vector sample.

6.7.16.41.2maximum squared distance from meanEach sample of 'SOVMaxSqDistFromMean' stores an upper bound of the squared euclidean distance between a group of 'nElement' samples of the original series and their mean. To obtain such an upper bound for a set of N samples, calculate:

6.7.17 SOVFixedScaleTypeUse this type to instantiate a series of vectors with a uniform, power-of-two scale ratio. The scale ratio is the number of samples of the original series represented by each sample of the scaled series. It defines the resolution of the scaled series. The entire series uses the same scale ratio, and this ratio is constrained to be a power of two. This constraint simplifies the semantics. Use SOVFixedScaleType unless the application requires the more flexible (and complex) SOVFreeScaleType.

6.7.17.1 Description Syntax<complexType name="SOVFixedScaleType" base="mds:SOVScaledType" derivedBy="extension"> <element name="VarianceScalewise" type="mds:FloatMatrix" minOccurs="0"/> <attribute name="RootFirst" type="boolean" use="default" value="false"/> <attribute name="ScaleRatio" type="positiveInteger" use="required"/> <attribute name="TotalSamples" type="positiveInteger" use="required"/></complexType>

6.7.17.2 Description SemanticsName DefinitionSOVFixedScaleType A representation of a reduced-resolution series of vector samples.ScaleRatio Number of orignal samples represented by each sample of the scaled series. Must

be a power of two.VarianceScalewise Array of arrays of scalewise summed-variance coefficients. Scalewise variance is

a decomposition of the variance into a series of coefficients, each of which describes the variability at a particular scale. Number of rows must equal 'nElements', number of columns must equal the number of coefficients of the scalewise variance.

RootFirst If true, the series are recoded in "root-first" format. See explanation for SOSFixedScaleType.

TotalSamples Total number of samples in the original series before it was scaled. Each sample is a vector.

6.7.18 SOVFreeScaleTypeInstantiate this for a series of vectors with variable scaling and arbitrary values (not just power-of-two) of the scaling ratio. SOSFreeScaleType may allow a better usage of descriptor storage space, by avoiding long series of identical samples for the parts of a series that are constant. The scaling ratio can be set according to any criterion (ScalableSeries are agnostic as to how the parameter was set).

6.7.18.1 Description Syntax<complexType name="SOVFreeScaleType" base="mds:SOVScaledType"

derivedBy="extension"> <sequence minOccurs="1" maxOccurs="unbounded"> <element name="scaleRatio" type="positiveInteger"/> <element name="nSamples" type="positiveInteger"/> </sequence></complexType>


Name DefinitionSOVFreeScaleType A representation of a reduced-resolution series of vector samples. Samples occur in

runs. Each run has its own scale ratio. The series may contain any number of runs.scaleRatio Scale ratio value common to all the samples in a runnSamples Number of samples in a run.

6.7.19 SOVUnscaledTypeUse this type to instantiate a series of vector data that are not (yet) scaled. Instructions may be provided to specify how the data should be scaled when the need arises.


<complexType name="SOVUnscaledType" base="mds:SeriesOfVectorType" derivedBy="extension"> <element name="Data" type="mds:FloatMatrix" minOccurs="1"/> <element name="Weight" type="mds:FloatVector" minOccurs="0"/>Editor's Note: minOccurs is not allowed in attributeGroup <attributeGroup ref="mds:SOVScalingInstructions" minOccurs="0"/></complexType>

<attributeGroup name="SOVScalingInstructions"> <attribute name="MeanFlag" type="boolean" use="default"

value="false"/> <attribute name="MinFlag" type="boolean" use="default"

value="false"/> <attribute name="MaxFlag" type="boolean" use="default"

value="false"/> <attribute name="UniformFlag" type="boolean" use="default"

value="false"/> <attribute name="RandomFlag" type="boolean" use="default"

value="false"/> <attribute name="VarianceFlag" type="boolean" use="default"

value="false"/> <attribute name="CovarianceFlag" type="boolean" use="default"

value="false"/> <attribute name="VarianceSummedFlag" type="boolean" use="default"

value="false"/> <attribute name="VarianceScalewiseFlag" type="boolean" use="default"

value="false"/> <attribute name="MaxSqDistFlag" type="boolean" use="default"

value="false"/></attributeGroup>

6.7.19.2 Description SemanticsSemantics of the SOVUnscaledType D:

Name DefinitionSOVUnscaledType A representation of a series of vector values of a feature at full resolution (no

scaling).Data Values. Number of rows must equal 'nElements', number of columns must equal

'VectorSize'.Weight Optional array of weights. Weights serve to control scaling. Size of array must

equal 'nElements'.SOVScalingInstructions Optional instructions for scaling. If present, these instructions specify how the data

should be scaled, should the need for scaling arise. The scaled data are then stored in a scaled type (SOVFreeScaleType or SOVFixedScaleType).

MeanFlag If true, scale to series of mean.MinFlag If true, scale to series of min.MaxFlag If true, scale to series of max.RandomFlag If true, scale by choosing at random in each group of samples.FirstFlag If true, scale by choosing first of each group of samples.LastFlag If true, scale by choosing last of each group of samples.VarianceFlag If true, scale to series of variance vectors.CovarianceFlag If true, scale to series of covariance matrices.VarianceSummedFlag If true, scale to series of summed variance coefficients.VarianceScalewiseFlag If true, scale to series of vectors of scalewise summed-variance coefficients.MaxSqDistFlag If true, scale to series of maximum squared distance coefficients.

6.7.20 Additional informative text

6.7.20.1 Scalable series FAQ

6.7.20.1.1 I am an end user, how do I benefit from scalable series?You probably won't ever meet one. Scalable series help the designer of the software that you use do a good job handling large amounts of data. Scalable series can handle data of any size, so they will help make today's tools still work in year 2020 when files, media collections, and networks are all about a million times larger than they are today (supposing a doubling in size every year).

6.7.20.1.2 What is a scalable series?A Scalable Series is a data structure for representing series of values, for example time series. Such series occur in descriptors for audio or other data. The ScalableSeries allows these series to be scaled in a well-controlled fashion.

6.7.20.1.3 I am a descriptor designer, when might I use a scalable series?Whenever you need to represent a descriptor that can occur as a series, in particular a time series. Currently only series of scalars and vectors are supported. Series of matrices could be defined if necessary.

6.7.20.1.4 I am a descriptor designer, why should I use a scalable series?Because it allows your descriptor to represent indifferently: (a) a series of descriptor values at full resolution, (b) the same series with reduced resolution, (c) a single value. In the latter case there is no overhead. If you don't use a scalable series, you must decide the appropriate resolution yourself, and you may severely limit the options of the application developper that uses your descriptor.

6.7.20.1.5 I am a descriptor designer, how do I use a scalable series?You just define the data as SeriesOfScalar or SeriesOfVector, whatever is appropriate. These data structures support scaling when the need arises. It is worth giving some thought to the way your descriptor reacts to scaling operations (mean, min, max, etc.). To take an example, when designing an audio spectrum descriptor, you might choose a power spectrum rather than a log magnitude spectrum to take advantage of the well-behaved averaging properties of the power spectrum.

6.7.20.1.6 I am an application designer, how do I benefit from scalable series?Your data can be stored at any resolution. You choose the resolution you need, and not based on the assumptions of a descriptor designer. Your software may change the resolution dynamically, to suit the user's desires and/or changes in storage or bandwidth constraints. It is possible to store the same descriptor at several resolutions, and/or using different scaling semantics (for example both "Mean" and "Variance", "Min" and "Max", etc.).

6.7.20.1.7 I have no patience for this scaling business, may I just ignore it?Yes. A series may be instantiated at full resolution (using SOSUnscaledType or SOVUnscaledType) with no overhead. You can also use a scalable series to store a single value with no overhead.

6.7.20.1.8 I'm labeling time-series data with a scalar descriptor, how do I store it at full resolution?You're extracting the descriptor at certain time intervals. The descriptor should be a subtype of 'SampledType' and therefore have a field named 'HopSize'. You put the time interval in that field. You instantiate the SeriesOfScalar using the subtype 'SOSUnscaledType'. You set the "nElements" field to the number of samples. If you expect the series to need scaling at some later time, set the appropriate flags to say how scaling should be performed.

6.7.20.1.9 I'm labeling data with a scalar descriptor, how do I store it at reduced resolution?You put the extraction period (frame period) in the 'HopSize' field, as before. You instantiate the SeriesOfScalar using either a subtype of either SOSFixedScaleType or SOSFreeScaleType, according to whether you intend to use different scale ratios for different parts of the series (if in doubt, choose SOSFixedScaleType as it is simpler). You set 'TotalSamples' to the number of samples before scaling, and 'nElements' to the number of samples after scaling. You perform the scaling according to one (or more) of the allowed methods and store the data in the appropriate field.

6.7.20.1.10I'm reading an MPEG7 file, how do I make sense of this descriptor?The type of the descriptor tells you what it is, and the documentation specifies how it was extracted. The descriptor's 'HopSize' field tells you at what intervals it was extracted. The names of the fields present ('Mean', etc.) tells you if it was scaled, and how. 'ScaleRatio' and 'TotalSamples' fully specify the scaling geometry. Fields such as 'Variance', if present, may be useful for search.

6.7.20.1.11My MPEG7 files take too much space, I can no longer afford the storage. Can I downsize the descriptions?

You can downsize any descriptor that uses a scalable series. You apply the scaling rules corresponding to the field type (mean, max, etc.) or specified by the flags (in SOSUnsampledType and SOVUnsampledType). You then update 'nElements' and 'ScaleRatio'.

6.7.20.1.12What's the story behind "SOSFixedScaleType" and "SOSFreeScaleType"?"SOSFixedScaleType" is used when the resolution (scale ratio) is the same throughout the series. This scale ratio is further constrained to be a power of two (this simplifies things). "SOSFreeScaleType" allows more flexibility in two ways. The scale ratio may vary during the series (i.e. the first few elements may each summarize N1 values, the next few N2, etc.), and its values are not constrained to be powers of two. The application programmer may choose either, but "SOSFixedScaleType" is recommended by default as it is simpler.

6.7.20.1.13What is "RootFirst"?It is possible to transform the series so that it starts with the grand mean (or grand maximum, or grand minimum, etc.), followed by coefficients that allow the means over successively smaller intervals to be calculated, and ultimately the values of the series. The cost is the same as if the series itself were transmitted. This allows the description to be progressively refined, for example when it is transferred over a network. "RootFirst", when true, indicates that the series is in this particular format.

7 Description of the media

7.1.1 MediaIdentification DS The Media Identification DS contains description tools (Ds and DSs) that are specific to the identification of the master media (the instances are described by the Media Instance DS).




7.1.1.4 Description Example<MediaIdentification>

<Identifier IdOrganization=’ISO’ IdName=’ISBN’>0-7803-5610-1 </Identifier></MediaIdentification>



<MediaIdentification><Identifier IdOrganization=’MPEG’ IdName=’MPEG7ContentSet’>

mpeg7_content:news1 </Identifier> <VisualDomain>natural</VisualDomain></MediaIdentification>

7.1.1.5 Description UseThe MediaIdentification DS is used to uniquely identify the AV content to which the remainder descriptions are related. It identifies the master media, not the different possible instances (described via de Media Instance description scheme –see section 7.1.4). The different domain Ds allow to describe the nature (source, acquisition, use) of the content under description.

7.1.2 MediaFormat DS The Media Format DS contains description tools (Ds and DSs) that are specific to the storage format of the media.







<MediaFormat><FileFormat>MPEG-1</FileFormat><System>PAL</System><Medium>CD</Medium><Color>color</Color><Sound>mono</Sound><FileSize>666.478.608</FileSize><Length><m>00:38</m><s>18</s></Length><AudioChannels>1</AudioChannels><AudioLanguage>

<LanguageCode>es</LanguageCode> <CountryCode>es</CountryCode>

</AudioLanguage> <AudioCoding>AC-3</AudioCoding></MediaFormat>

7.1.2.5 Description UseThe Media Format DS is used for describing the different possible physical formats of the AV content.

7.1.3 MediaCoding DS The Media Coding DS contains description tools (Ds and DSs) that are specific to the coding parameters of the media.




7.1.3.4Description Example<MediaCoding>

<FrameWidth>352</FrameWidth><FrameHeight>288</FrameHeight><FrameRate>25</FrameRate><CompressionFormat>MPEG-1</CompressionFormat>

</MediaCoding>

7.1.3.5 Description UseThe Media Coding DS is used for describing the different possible coding and compression parameters of the instances of the coded AV content.Editorial note: The AspectRatio D is redundant with the FrameWidth and FrameHeight ones. Nevertheless, it has been included because it is also present in the (current version of the) SMPTE Metadata Dictionary.

7.1.4 MediaInstance DS The Media Instance DS contains the description tools (Ds and DSs) that identify and locate the material instances.




7.1.4.4 Description Example<MediaInstance>

<Identifier IdOrganization=’MPEG’ IdName=’MPEG7ContentSetCD’> mpeg7_17/news1</Identifier><InstanceLocator>

<MediaURL>file://D:/Mpeg7_17/news1.mpg</MediaURL></InstanceLocator>

</MediaInstance>

7.1.4.5 Description UseThe Media Instance DS is used for identifying instances of the AV material described (uniquely identified via the Media Identification DS identifier) and for locating the instances., The Instance location can be either on-line (URL allowing direct access to the media instance) or off-line (for an off-line request of delivery or even of digitalization and on-line publication).

7.1.5 MediaTranscodingHints DSThe Media TranscodingHints DS contains description tools that allow to describe transcoding hints of the media being described.

The MediaTranscodingHints DS is proposed as an extension of the MediaProfileDS as shown below. A future promotion will imply a slight modification in the MediaProfile DS, as shown below.



<complexType name="MediaProfile"><element name="MediaIdentification" type="mds:MediaIdentification"/><element name="MediaFormat" type="mds:MediaFormat"/><element name="MediaCoding" type="mds:MediaCoding" minOccurs="0"

maxOccurs="unbounded"/><element name="MediaInstance" type="mds:MediaInstance" minOccurs="0"

maxOccurs="unbounded"/> <element name="MediaTranscodingHints" type=" mds:MediaTranscodingHints" minOccurs="0" maxOccurs="1"/>

<attribute name="id" type="ID"/></complexType>

7.1.5.1 Description Scheme Syntax

<complexType name="MediaTranscodingHints"><element name="MotionHint" type="mds:MotionHint"

minOccurs="0" maxOccurs="1"/> <element name="DifficultyHint" type="mds:DifficultyHint" minOccurs="0" maxOccurs="1"/> <attribute name="DifficultyHint" type="float" use="optional"/> <attribute name="Importance" type="float" use="optional"/> <attribute name="id" type="ID" use="optional"/>

</complexType>

Editor's note: the DifficultyHint type is not defined

7.1.5.2 Description Scheme SemanticsName DefinitionMediaTranscodingHints DS describing transcoding hints of the media being described.id Identification of the instance of the MediaTranscodingHint description. DifficultyHint Attribute that specifies the transcoding difficulty of the media. The DifficultyHint

takes values from 0.0 to 1.0, where 0.0 indicates the lowest importance and 1.0 indicates the highest importance.

Importance Attribute that specifies the importance of the media. The importance takes values from 0.0 to 1.0, where 0.0 indicates the lowest importance and 1.0 indicates the highest importance.

MotionHint Description of the motion hint for the transcoder.

7.1.5.3 Description ExtractionThe extraction method of Motion Hint DS is described in section 7.1.6. The extraction method of Difficulty Hint DS is described in 7.1.7.The Difficulty Hint targets on describing the difficulty to compress the material within the AV segment for efficient bit allocation of the transcoder. The transcoder can assign the appropriate bitrate based on both importance and difficulty of contents. This Difficulty Hint DS is especially useful for the transcoding from CBR(Constant Bit Rate) coding to VBR(Variable Bit Rate) coding.The absolute value of the difficulty is not important for the transcoder. The relative value within contents is useful in order to allocate the bitrate efficiently. The difficulty is normalized within the contents. The maximum value of the Difficulty Hints is 1.0, which indicates the most difficult part to encode within the contents. In the simplest case, the Difficulty Hint can be measured in the following steps.

1. Encode material at Q=1 (Uncompressed)2. Calculate generated bit/frame at each Segment3. Normalize the value calculated in step 2.

In the simplest case, the allocated bits at the segment for CBR-VBR conversion can be calculated by the following formula.

Target bitrate = (Total_bits * Difficulty*LengthofSegment / (Sum of (Difficulty*Length of Segment))) * FrameRate / LengthofSegment

where LengthofSegment : Number of Frames in the Segment

7.1.5.4 Description Example<MediaTranscodingHint id="segment1" DifficultyHint="0.562789" Importance="0.946723"> <MotionHint MotionR_rangeXL_x_left="-18" MotionR_rangeXR_x_right="22" MotionR_rangeYTop_y_left="-14" MotionR_rangeYBottom_y_right="12"/> </MotionHint> <DifficultyHint type="TranscodingDifficulty">0.017941</DifficultyHint></MediaTranscodingHint>

7.1.5.5 Description UseThe Media Transcoding Hint DS is used for reducing the complexity and improving the picture quality of transcoder in an application that deals with delivery of image, video, audio and multimedia content under different network conditions, user and publisher preferences, and capabilities of terminal devices. The Difficulty Hint is used for the efficient bit allocation for the transcoding from CBR to VBR. In general, the coding efficiency of the VBR is better than CBR. The encoder can assign more bits at a difficult scene and remove bits from a relatively easy scene. Because the picture quality of the difficult scene is improved while keeping the picture quality at an easy scene, the overall subjective quality could be improved very much. Example of video transcoder architectures is given in [6].

7.1.6 MotionHint DS The Motion Hint DS targets on describing the motion within this AV segment to improve the picture quality and to reduce the computational complexity requirements for hardware and software implementations of the Transcoder by motion hints for the AV segment. This Description Scheme is mainly useful for Transcoding from uncompressed or non-motion compensation based source format (non-MPEG or I-frame only MPEG) to a motion-compensation (e.g. IPB-MPEG) based target format. The transcoding hints can be used in video transcoding architectures to as those described in [6] to reduce the computational complexity.

7.1.6.1 Description Scheme Syntax

<complexType name="MotionHint" content="empty"> <attribute name="MotionR_rangeXL_x_left" type="integer"/> <attribute name="MotionR_rangeXR_x_right" type="integer"/> <attribute name="MotionR_rangeYBottom_y_left" type="integer"/> <attribute name="MotionR_rangeYTop_y_right" type="integer"/></complexType>

[Note 1 : Motion_uncompensability is not verified in the Geneva meeting. This will be added when it is verified in Bejing meeting.

7.1.6.2 Description Scheme SemanticsName DefinitionMotionHint DS describing motion hints of the media being described.MotionR_range_x_lXLeft

Integer that specify the recommended search range of the horizontal motion vectors to the left.

MotionR_rangeXR_x_right

Integer that specify the recommended search range of the horizontal motion vectors to the right.

MotionR_rangeYBottom_y_left

Integer that specify the recommended search range of the vertical motion vectors to the leftbottom.

MotionR_rangeYTop_y_right

Integer that specify the recommended search range of the vertical motion vectors to the righttop.

7.1.6.3 Description ExtractionThis DS is independent of the source contents format. Feature points are extracted (for example by a Lucas/Kanade feature point tracker) for every single frame of a video sequence. These feature points are tracked from frame to frame, if possible. A constant number of feature points is used per frame. By the feature point tracking method we gain the motion vector of each single feature point. The motion vector component of maximum size for this AV segment in positive and negative x-direction are saved in MotionRangeXLeft_range_x_left, MotionRangeYRight_range_x_right. The motion vector component of maximum size in positive and negative y-direction are saved in MotionRangeYBottom_range_y_left, Motion_range_y_rightRangeYTop for this AV segment.

7.1.6.4 Description Example<MediaTranscodingHint id="segment1"> <MotionHint MotionRangeXLeft_range_x_left="-15" MotionRangeXRight_range_x_right="15" MotionRangeYBottom_range_y_left="-15" MotionRangeYTop_range_y_right="15"/></MediaTranscodingHint>

<MediaTranscodingHint id="segment2"><MotionHint MotionRangeXLeft_range_x_left="-18"

MotionRangeXRight_range_x_right="22"

MotionRangeYBottom_range_y_left="-14" MotionRangeYTop_range_y_right="12"/></MediaTranscodingHint>

7.1.6.5 Description UseThis Motion Hint DS helps the Transcoder to select the appropriate motion estimation search range and also motion estimation algorithmic parameters. As an example for algorithmic parameter selection, for a case the Motion_range_x_leftRangeXLeft and Motion_range_x_rightRangeXRight is high, a hierarchical or coarse search motion estimation algorithm with high search range can be employed in x-direction to reduce the computational complexity. In an other case when the Motion_range_x_leftRangeXLeft and Motion_range_x_rightRangeXRight is low, the computational complexity can be reduced by adjusting the search range of the motion estimation algorithm suitably (and e.g. fcode in case of MPEG), while preserving the visual quality.

7.1.7 DifficultyHint DS The Difficulty Hint DS targets on describing the difficulty to compress the material within the AV segment for efficient bit allocation of the transcoder. The transcoder can assign the appropriate bitrate based on both importance and difficulty of contents. This Difficulty Hint DS is especially useful for the transcoding from CBR(Constant Bit Rate) coding to VBR(Variable Bit Rate) coding.

7.1.7.1 Description Scheme SyntaxEditor's Note: attribute cannot be a child of element<element name="DifficultyHint" type="mds:Difficulty"> <attribute name="Type" type="DifficultyType"/></element>

<simpleType name="Difficulty" base="float"> <minInclusive value="0.0"/> <maxInclusive value="1.0"/></simpleType>

<simpleType name="Type" base="string"> <enumeration value="TranscodingDifficulty"/></simpleType>

7.1.7.2 Description Scheme SemanticsName DefinitionDifficulty Float that specifies the transcoding difficulty. This value must be defined within 0.0

and 1.0 where 1.0 is used for the most difficult AV segment to encode within content

Type String that specify the type of difficulty

7.1.7.3 Description ExtractionThe absolute value of the difficulty is not important for the transcoder. The relative value within contents is useful in

order to allocate the bitrate efficiently. The difficulty is normalized within the contents. The maximum value of the Difficulty Hints is 1.0, which indicates the most difficult part to encode within the contents. In the simplest case, the Difficulty Hint can be measured in the following steps.

4. Encode material at Q=1 (Uncompressed)5. Calculate generated bit/frame at each Segment6. Normalize the value calculated in step 2.

In the simplest case, the allocated bits at the segment for CBR-VBR conversion can be calculated by the following formula.

Target bitrate = (Total_bits * Difficulty*LengthofSegment / (Sum of (Difficulty*Length of Segment))) * FrameRate / LengthofSegment

LengthofSegment : Number of Frames in the Segment

7.1.7.4 Description Example<MediaTranscodingHint id="segment1"> <DifficultyHint type="TranscodingDifficulty"> 0.017941 </DifficultyHint></MediaTranscodingHint>

<MediaTranscodingHint id="segment2"> <DifficultyHint type="TranscodingDifficulty"> 0.035726 </DifficultyHint></MediaTranscodingHint>

7.1.7.5 Description UseThe Difficulty Hint DS is used for the efficient bit allocation for the transcoding from CBR to VBR. In general, the

coding efficiency of the VBR is better than CBR. The encoder can assign more bits at a difficult scene and remove bits from a relatively easy scene. Because the picture quality of the difficult scene is improved while keeping the picture quality at an easy scene, the overall subjective quality could be improved very much.

7.1.8 MediaProfile DS The Media Profile contains the different description tools that allow to describe one profile of the content being described.




7.1.8.4 Description ExampleSee example in section 7.1.9.

7.1.8.5 Description UseThe Media Profile DS contains all the description tools that allow to define a profile for AV content instances, that is, it allows to define the different formats, resolutions, compression formats, etc., that can be available for an AV content master and identify and locate the different instances of such master media profile.The multiple cardinality of the MediaCoding DS allows to define in the profile the different coding of each of the monomedia components (e.g., the audio and the video) that are stored in a Media Format. For example, there can be several MPEG-2 file profiles with different pairs of coding values for the audio and the video.

7.1.9 MediaInformation DS The Media information DS contains description tools (Ds and DSs) that are specific to the storage media.




7.1.9.4 Description Example<MediaInformation> <MediaIdentification>

<Identifier IdOrganization=’MPEG’ IdName=’MPEG7ContentSet’> mpeg7_content:news1 </Identifier> </MediaIdentification>

<MediaProfile> <MediaFormat>

<FileFormat>MPEG-1</FileFormat><System>PAL</System><Medium>CD</Medium><Color>color</Color><Sound>mono</Sound><FileSize>666.478.608</FileSize><Length><m>00:38</m><s>18</s></Length>

<AudioChannels>1</AudioChannels> </MediaFormat> <MediaCoding>

<FrameWidth>352</FrameWidth><FrameHeight>288</FrameHeight><FrameRate>25</FrameRate><CompressionFormat>MPEG-1</CompressionFormat>

</MediaCoding> <MediaInstance>

<Identifier IdOrganization=’MPEG’ IdName=’MPEG7ContentSetCD’> mpeg7_17/news1 </Identifier>

<Locator> <MediaURL>file://D:/Mpeg7_17/news1.mpg</MediaURL>

</Locator> </MediaInstance> </MediaProfile></MediaInformation>

7.1.9.5 Description UseThe Media Information DS is used for containing the different profiles available for an AV content.

8 Description of the content creation & production

8.1.1 Creation DS The Creation DS contains the description tools (Ds and DSs) related to the creation of the content, including places, dates, actions, materials, staff (technical and artistic) and organizations involved.




8.1.1.4 Description Example<Creation>

<Title type="original"> <TitleText xml:lang="es">Telediario (segunda edición)

</TitleText><TitleImage>

<MediaURL>file://images/teledario_ori.jpg</MediaURL> </TitleImage>

</Title><Title type="alternative">

<TitleText xml:lang="es">Noticias de la tarde</TitleText><TitleImage>

<MediaURL>file://images/teledario_alt.jpg</MediaURL> </TitleImage>

</Title> <Title type="alternative"> <TitleText xml:lang="en">Afternoon news</TitleText>

<TitleImage> <MediaURL>file://images/teledario_en.jpg</MediaURL> </TitleImage>

</Title> <Creator>

<role>presenter</role><Individual>

<GivenName>Ana</GivenName> <FamilyName>Blanco</FamilyName> </Individual> </Creator>

<CreationDate>1998-06-16 <D>16</D> <M>6</M> <Y>1998</Y> </CreationDate>

<CreationLocation><PlaceName xml:lang="es">Piruli</PlaceName><Country>es</Country><AdministrativeUnit>Madrid</AdministrativeUnit>

</CreationLocation></Creation>

Editor's Note: The time description should be updated

8.1.1.5 Description UseThe Creation DS is used for describing all the information related to the creation of the AV content.

8.1.2 Classification DS The Classification DS contains the description tools (Ds and DSs) that allow classifying the material.



8.1.2.3 Description ExtractionManual instantiation

8.1.2.4 Description Example<Classification>

<CountryCode>es</CountryCode><Language>

<LanguageCode>es</LanguageCode> <CountryCode>es</CountryCode>

</Language><Genre>News</Genre><PackagedType>Information</PackagedType><Purpose>broadcasting</Purpose>

<ParentalGuidanceAgeClassification>allPG13</ParentalGuidanceAgeClassification></Classification>

8.1.2.5 Description UseThe Classification DS is used for the description of the classification of the AV content. It allows searching and filtering based on user preferences regarding user-oriented classifications (e.g., language, style, genre, etc.) and service-oriented classifications (e.g., purpose, parental guidance, market segmentation, etc.).

8.1.3 MediaReview DS The MediaReviewDS is proposed as an extension of the ClassificationDS as shown below. A future promotion to the WD will imply a slight modification in the Classification DS, as shown below.



<complexType name="Classification"><element ref="Country" minOccurs="0" maxOccurs="unbounded"/><element name="Language" type="language"

minOccurs="0" maxOccurs="unbounded"/><element name="Genre" type="mds:ControlledTerm"

minOccurs="0" maxOccurs="unbounded"/><element name="PackagedType" type="mds:ControlledTerm"

minOccurs="0" maxOccurs="unbounded"/><element nameref="MediaReview" type="mds:MediaReview" minOccurs="0"

maxOccurs="unbounded" /><attribute name="id" type="ID"/>

</complexType>

8.1.3.1 Description Scheme SyntaxEditor's Note: Why most of the entities are defined as (global) element ? The specification is inconsistent with the UserPreference.

<element name="FreeTextReview"> <complexType base="string" derivedBy="extension"> <attribute name="mpeg7:lang" type="language"/> </complexType></element>

<element name="Reviewer"><complexType base="mds:Person" derivedBy="extension"> <element name="role" type="mds:ControlledTerm" minOccurs="0"/></complexType>

</element>

<element name="RatingCriterion"> <complexType> <element name="CriterionName" type="string"/> <element name="WorstRating" type="integer"/> <element name="BestRating" type="integer"/> </complexType></element>

<element name="Rating"> <complexType> <element ref="RatingCriterion"/> <element name="RatingValue" type="integer"/>

</complexType></element>

<element name="MediaReview"> <complexType> <element ref="Reviewer" minOccurs="1" maxOccurs="1"/> <element ref="Rating" minOccurs="0" maxOccurs="unbounded"/> <element ref="FreeTextReview" minOccurs="0" maxOccurs="unbounded"/> <attribute name="id" type="ID"/>

</complexType></element><complexType name="Reviewer" base="mds:Person" derivedBy="extension">

<element name="role" type="mds:ControlledTerm" minOccurs="0"/></complexType>

<complexType name="MediaReview"> <element ref="Reviewer" type="mds:Reviewer" minOccurs="1" maxOccurs="1"/> <element name="RatingValue" type="integer" minOccurs="1" maxOccurs="1"/> <element name="RatingCriterion" minOccurs="1" maxOccurs="1"> <complexType> <element name="CriterionName" type="mds:TextualDescription" /> <element name="WorstRating" type="integer" /> <element name="BestRating" type="integer" /> </complexType> </element> <element name="FreeTextReview" type="mds:TextualDescription" minOccurs="0" maxOccurs="unbounded"/> <attribute name="id" type="ID"/></complexType>

8.1.3.2 Semantics of MediaReview DSName DefinitionMediaReview Media Review about the AV content.Id Identification of the instance of the MediaReview description.Reviewer The reviewers/critics of the AV content. Being an extension of the Person DS, this field

can describe an individual or group of individuals or an organization. It allows the users to decide if they will use this reviewer’s Media Review in their content selection/filtering process. The reviewer can be an organization such as MPAA (Motion Pictures Association of America) or it can be an individual or group of individuals, for example, Siskel & Ebert. - It also allows multiple instance of Media Review about a single content by multiple

critics. RatingValue The rating value assigned to the AV content from a rating criterion.RatingCriterion The rating is a constrained value from a pre-defined rating scheme or rating criterion. The

rating criterion includes the CriterionName and the rating scale (form WorstRating to BestRating)

FreeTextReview Unconstrained free textual review of the AV content with no reference to a rating scheme. There can be multiple instances in different languages.


8.1.3.4 Description ExampleExample 1:

<MediaReview><Reviewer>

<Individual><FamilyName> Ebert </FamilyName><GivenName> Roger </GivenName>

</Individual></Reviewer><Rating> <RatingCriterion>

<CriterionName> Overall </CriterionName> <WorstRating> 1 </WorstRating> <BestRating> 10 </BestRating>

</RatingCriterion> <RatingValue> 10 </RatingValue>

</Rating><FreeTextReview xml:lang="en"> Excellent Drama </FreeTextReview><FreeTextReview xml:lang="es"> Excelente </FreeTextReview>

</MediaReview>

Example 2:

<MediaReview><Reviewer>

<Organization><OrganizationName> Blockbuster </OrganizationName>

</Organization></Reviewer>

<Rating> <RatingCriterion>

<CriterionName>Number_of_Rentals_Nationwide</CriterionName><WorstRating> 20 </WorstRating><BestRating> 1 </BestRating>

</RatingCriterion> <RatingValue> 14 </RatingValue>

</Rating><FreeTextReview xml:lang="en">

Top 20 most rented video for the period March-April, 2000 </FreeTextReview></MediaReview>

8.1.3.5 Description UseThe MediaReview DS includes third party reviews of an AV content, for example, critic’s reviews of a movie. The Media Review DS can be used in conjunction with user preferences to enable the automatic selection, filtering or recording of AV content on a Personalized TV (PTV).

8.1.4 RelatedMaterial DS The Related Material DS contains the description tools (Ds and DSs) related to additional information about the content available in other material.




8.1.4.4 Description Example<RelatedMaterial>

<Master>false</Master><MediaType>Web</MediaType><MediaLocator>

<MediaURL>www.rtve.es</MediaURL></MediaLocator>

</RelatedMaterial>

8.1.4.5 Description UseString and controlled term search, browse and presentation.

8.1.5 Creation MetaInformation DS The Creation Meta information DS contains the description tools (Ds and Dss) that carry author-generated information about the generation/production process of an AV program or an image that cannot usually be extracted from the content itself (the actors creating the content can be extracted from the content). This information is related to the material but it is not explicitly depicted in the actual content. The Creation Meta Information DS contains: Information about the creation not perceived in the material (e.g., the author of the script, the director, the character

description, the target audience, the rating, etc.) and information about the creation perceived in the material (e.g., the actors in the video, the players in a concert), and

Classification related information (target audience, style, genre, rating, etc.). Note that this DS may be attached to any Segment of the structure DS. Indeed, a complete visual description may contain segments that are annotated with more details and these segments may be produced and used independently and/or in different AV materials and segments.

The description of actors within the Creation DS may be linked with the description of the content in terms of characters (to be defined within the Semantic DS?), using a Cast (Creator-Character Link) DS.




8.1.5.4 Description Example<CreationMetaInformation>

<Creation><Title type="original">

<TitleText xml:lang="es">Telediario (segunda edición) </TitleText>

<TitleImage> <MediaURL>file://images/teledario_ori.jpg</MediaURL> </TitleImage>

</Title> <Title type="alternative"> <TitleText xml:lang="es">Noticias de la tarde</TitleText>

<TitleImage> <MediaURL>file://images/teledario_alt.jpg</MediaURL> </TitleImage>

</Title> <Title type="alternative"> <TitleText xml:lang="en">Afternoon news

</TitleText><TitleImage>

<MediaURL>file://images/teledario_en.jpg</MediaURL> </TitleImage> </Title> <Creator>

<role>presenter</role><Individual>

<GivenName>Ana</GivenName> <LastName>Blanco</LastName> </Individual> </Creator>

<CreationDate>1998-06-16<D>16</D><M>6</M><Y>1998</Y></CreationDate><CreationLocation>

<PlaceName xml:lang="es">Piruli</PlaceName><Country>es</Country><AdministrativeUnit>Madrid</AdministrativeUnit>

</CreationLocation> </Creation>

<Classification><CountryCode>es</CountryCode><Language>

<LanguageCode>es</LanguageCode> <CountryCode>es</CountryCode> </Language>

<Genre>News</Genre><PackagedType>Information</PackagedType><Purpose>broadcasting</Purpose><AgeClassification>all</AgeClassification>

</Classification>

<RelatedMaterial>

<Master>false</Master><MediaType>Web</MediaType><MediaLocator><MediaURL>www.rtve.es</MediaURL></MediaLocator>

</RelatedMaterial></CreationMetaInformation>


9 Description of the content usage

9.1.1 Rights DS The Rights DS contains the description tools (Ds and DSs) related to the right holders of the annotated content (IPR) and the Access Rights.




9.1.1.4 Description ExampleThe Rights Identifier needs further specification, but an example may look like the following.

<Rights><RightsId IdOrganization=’TVE’ IDName=’TVE_rights’>

tve:19980618:td2</RightsId>

</Rights>

9.1.1.5 Description UseString search, browse and presentation.

9.1.2 UsageRecord DS The UsageRecord DS contains the description tools (Ds and DSs) related to the use (broadcasting, on demand delivery, CD sales, etc.) of the content, that is its life.




9.1.2.4 Description Example<UsageRecord>

<Type>Broadcast</Type><Channel>TVE:ES</Channel><Place><Country>es</Country></Place><Date>1998-06-16T16:30+01:00<D>16</D><M>6</M><Y>1998</Y></Date>

</UsageRecord>

Editor's Note: the time sepcification should be updated.


9.1.3 Financial DS The Financial DS contains information related to the costs generated and income produced in the production and marketing of AV content.




9.1.3.4 Description Example<Financial>

<Cost><CostType>production</CostType><internationalPrice currency="EU" value="423.46" />

</Cost></Financial>

9.1.3.5 Description UseNumerical (money amounts), string and controlled term search, browse and presentation. The notions of partial costs and incomes allows to classify the different costs and incomes by their type. As they type is defined as a controlled term, it may be possible to have a set of standardized types of costs and incomes. Total and subtotal costs and incomes are to be calculated by the application from these partial values.

9.1.4 UsageMetaInformation DS The Usage Meta information DS contains the description tools (Ds and DSs) that carry information about the usage process of an AV program or an image. The Usage Meta Information DS contains: Information about the rights for using the material Information about the ways and means to provide service over the material (e.g., edition, emission, etc.) and the

results of the service provision (e.g., audience) Financial information about the financial results of the production (in the Financial DS within the

UsageMetaInformation DS) and of the publication (in the Financial DS within each Publication DS) of the material. .




9.1.4.4 Description Example<UsageMetaInformation>

<Rights><RightsId IdOrganization=’TVE’ IDName=’TVE_rights’>

tve:19980618:td2

</RightsId></Rights><UsageRecord>

<Type>Broadcast</Type><Channel>TVE:ES</Channel><Place><Country>es</Country></Place>

<Date>1998-06-16T16:30+01:00<D>16</D><M>6</M><Y>1998</Y></Date> </UsageRecord>

<Financial><Cost>

<CostType>prodution</CostType><internationalPrice currency="EU" value="423.46"/>

</Cost></Financial>

</UsageMetaInformation>

Editor's Note: the time sepcification should be updated.

9.1.4.5 Description Use Numerical (money amounts), string and controlled term search, browse and presentation. The information carried in descriptions of usage information is related to the material but it is not explicitly linked to the actual content, and what is more important, many of their components may vary over time. That is, each time the AV content is used (broadcasted, published, delivered, etc.) again, a new UsageRecord DS instance will be incorporated. If the use implies some cost and revenues, a new Cost and/or Income DS instance will also be added to the Financial DS in the corresponding UsageRecord DS instance.

10 Description of the structural aspects of the contentThe physical and logical aspects of AV content are described by segment DSs, segment features, and the SegmentRelationshipGraph DSs. The segment DSs may be used to form segment trees to define the structure of the AV content, i.e. a table of contents. The segment features describe features of segments. The SegmentRelationshipGraph DS is used to describe temporal, spatial, and spatio-temporal relationships, among others, between segments that are not described by the tree structures.

10.1 Segment

10.1.1 Segment DS A segment represents a section of an AV content item. Its role is to define the common properties of the subclasses AudioSegment DS, StillRegion DS, MovingRegion DS and VideoSegment DS. Therefore, it may have both spatial and temporal properties. A temporal segment may be a set of samples in an audio sequence, represented by an AudioSegment DS, or a set of frames in a video sequence, represented by a VideoSegment DS. A spatial segment may be a region in an image or a frame in a video sequence, represented by a StillRegion DS. Finally, a spatio-temporal segment may correspond to a moving region in a video sequence, represented by a MovingRegion DS. A specialized MovingRegion DS, the VideoText DS, represents a text region in a still image or a set of video frames. The Segment DS is abstract and cannot be instantiated on its own; it is used to define the common properties of its sub-classes.



10.1.1.3 Description ExtractionSince the Segment DS is abstract, it cannot be instantiated as such. Extraction is therefore irrelevant for this DS.

10.1.1.4 Description ExampleSince the Segment DS is abstract, it cannot be instantiated as such. Examples are therefore irrelevant for this DS.

10.1.1.5 Description UseSince the Segment DS is abstract, it cannot be instantiated as such. Usage is therefore irrelevant for this DS.

10.1.2 VideoSegment DS A VideoSegment DS is a specific Segment DS. It inherits all the properties of the Segment DS (attributes, decomposition, descriptors and DSs). The VideoSegment DS describes a set of frames belonging to a video sequence. A single frame extracted from a video sequence is considered as a video segment. The frames may be contiguous in time or not. This is defined by the MediaTimeMask DS.



10.1.2.3 Description ExtractionThe extraction of a video segment is the result of a temporal segmentation method. This may be done automatically or by hand, based on semantics or other criteria. . An overview of temporal segmentation can be found in [1], [9] and [17]. In this section, we describe an automatic method for detecting scene changes [15] in MPEG-2 streams, which results in the temporal segmentation of video sequences into scenes.

This scene change detection algorithm is formed based on the statistical characteristics of the DCT DC values and motion vectors in the MPEG-2 bitstream. First, the algorithm detects suspected scene change on I, P, B-frames separately and dissolved editing effects. Then, a final decision is made to select the true scene changes. The overall scene change detection algorithm has five stages: minimal decoding, parsing, statistical, detection, and decision stages; these five stages are described below. Figure 4 shows the function blocks of each stage.

Figure 4: The Function Blocks of the Scene Change Detection Algorithm

Minimal Decoding Stage: The MPEG bitstream is decoded just enough to obtain motion vectors and the DCT DCs. Parsing Stage: The motion vectors in B, P frames are counted; the DCT DCs in P frames are reconstructed. Statistical Stage:

1. Compute R p, the ratio of intra coded blocks and forward motion vectors in P-frames.2. Compute R b, the ratio of backward and forward motion vectors in B-frames.3. Compute R f, the ratio of forward and backward motion vectors in B-frames.4. Compute the variance of DCT DC of luminance in I and P frames.

Detection Stage: 1. Detect R p peaks in P frames and mark them as suspected scene change frames.2. Detect R b peaks in B frames and mark them as suspected scene change frames.3. Detect R f peaks in B frames. Detect all |2|, absolute value of the frame intensity variance difference, in I, P

frames. Mark I-frames as suspected scene change frames if they have |2| peaks and if the immediate B-frames have R f peaks.

4. Detect the parabolic variance curve for dissolve effects. Decision Stage:

1. All suspected frames that fall in the dissolve region are unmarked.2. Exam through all marked frames from the lowest frame number

if (current marked frame number - last scene change number) > T rejection, then current marked frame is a true scene change

else unmark current frame(where T rejection is the rejection threshold, default one GOP.)

The criterion in step 2 of Decision Stage is used to eliminate situations when a scene change happens on a B frame, then the immediately subsequent P-frame and/or I-frame (of display order) will be likely marked as suspected scene change as well. But since they don’t satisfy the criterion of "the minimum distance between two scene changes has to be greater than T rejection", these suspected frames will be unmarked. Situations where multiple scene changes occur on one GOP are very rare.

For P frames, the marked frame decision can be obtained both from step 1 (R p peaks) and step 3 (|2| peaks) in the Detection Stage. The outcome from step 1 is usually more reliable. The outcome from step 3 can be used as reference if they are conflicting with the outcome of step 1.

The following sections describe a technique for detecting the peak ratios R b , R p and R f, and a method to detect dissolves.

10.1.2.3.1 Adaptive Local Window Threshold Setting Technique for Detecting Peak RatiosDifferent scenes have very different motion vector ratios. But within the same scene, they tend to be similar. Setting several levels of global thresholds will not only complicate the process but also cause false alarms and false dismissals. A local adaptive threshold technique is used to overcome this problem.

The peak ratios R b , R p and R f are detected separately. All ratios are first clipped to a constant upper limit, usually 10 to 20. Then ratio samples are segmented into windows. Each window is 2 to 4 times of the Group of Picture (GOP)’s size. For typical movie sequences, scene change distances are mostly greater than 24, so a window size of 2 GOP is sufficient. If GOP is 12, and window size is 2, there will be 6 ratio samples for P-frames and 12 samples for B-frame. These samples are enough to detect the peaks.

Within the window, histogram of the samples with a bin size of 256 is calculated. If the peak-to-average ratio is greater than the threshold, Td , then the peak frame is declared as a suspected scene change. The peak values are not included in calculating the average. A typical Td is 3 for R b , R p and R f . Figure 5b. shows the histogram of a local window corresponding to P-frames (from frame 24 to 47) in Figure 5a. where there is a peak at frame 29. For B-frames, if a scene change happens at a B-frame (frame 10, in Figure 5a.), then the ratio of the immediately subsequent B frame (frame 11) will be high also. Both of them will be considered as peaks and will not be calculated into the average, only the first B-frame will be marked as a suspected scene change.

Figure 5: Motion Vector Ratio In B and P Frames

10.1.2.3.2 Detection of DissolvesDissolve is the most frequently used editing technique to connect two scenes together. A dissolve region is created by linearly mixing two scenes sequences: one gradually decreasing in intensity, one gradually increasing. Dissolves can be detected by taking the pixel domain intensity variance of each frame and then detecting a parabolic curve. Given an MPEG bit-stream, the variance of the DCT DC coefficients in I and P frames is calculated instead of the spatial domain variance. Experiments have shown the approximation is accurate enough to detect dissolves.

10.1.2.3.2.1 Calculation of DCT DC ValuesThe I-frames in MPEG are all intra coded, so the DCT DC value can be obtained directly. The P-frames consist of motion compensated (MC) macroblocks and the intra coded macroblocks. An MC macroblock has a motion vector and a DCT coded MC error. Each macroblock consists of 4 luminance blocks(8x8 each) and some chrominance blocks(2, 6 or 8 based on chrominance format). To ensure maximum performance, only the DCT DC of luminance block is used for dissolve detection. The DCT DC values of B pictures could also be reconstructed applying the same technique.

To get the DCT DC values of a P frame, inverse motion compensation is applied on the luminance blocks and the DCT DC value of the error term is added to the reconstructed DCT DC. Assume the variance within each block is small enough, then b, the DCT DC of a MC block in P frame can be approximated by taking the area-weighted average of the four blocks in the previous frame pointed by the motion vector.

b = [b 0 *(8-x)*(8-y) + b 1 *x*(8-y) + b 2 *(8-x)*y + b 3 *x*y]/64 + b error_DCT_DC

where x, y are horizontal and vertical motion vector modulo block size 8; b 0 , b 1 , b 2 and b 3 are the DCT DC coefficients of the four neighboring blocks pointed by the motion vector; b error_DCT_DC is the DCT DC of the motion compensation residue error of block to be reconstructed (see Figure 6).

Figure 6: Inverse Motion Compensation of DCT DC

10.1.2.3.2.2 Dissolve DetectionTwo criteria are used --first, the depth of the variance valley must be large enough and second, the MediaDuration of the suspected dissolve region must be long enough (otherwise it’s more likely an abrupt scene change). A typical dissolve would last from 30 to 60 frames. The specific procedure is as following.

All the positive-peaks, p + are detected by using the local window method on 2, the frame variance difference; all the negative-peaks p - are found by the minimum value between the two positive-peaks; the peak-to-peak difference p = current peak - previous peak, is calculated and thresholded using the pro-posed local window method; finally potential matches with MediaDuration length (positive peak to current negative peak) long enough ( > T f frames, e.g. 1/3 of the minimum allowed dissolve MediaDuration) are declared as suspected dissolves.

The starting point of the suspected dissolve is the previous positive peak. If the next positive peak is at least T f frames from the current negative peak, then a dissolve is declared and the ending point is set to the next positive peak. Frames whose peak-to-peak distance meets the magnitude threshold, but fail to meet the MediaDuration threshold are usually the direct scene changes. Similarly, if the frame MediaDuration from the current negative peak to the next positive peak fails to meet T f , the suspected dissolve will be unmarked also.

10.1.2.4 Description ExampleA tree of video segments may be used for example to create a Table Of Contents. Used on its own, it may be used to combine different Descriptors characterizing the same video segment. An example of a video segment tree is included below. VS1 could be a video program in which two scenes, VS2 and VS4, were detected with the method described in the previous section. The video segment VS1 is not continuous in time; it is composed of two temporal intervals of 6 and 3 minutes.

<VideoSegment id = "VS1" > <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> <m>PT10M</m> </MediaDuration> </MediaTime> <MediaTimeMask NumberOfIntervals = "2"> <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> <m>6</m>PT6M </MediaDuration> </MediaTime> <MediaTime> <MediaTimePoint> <h>0</h> <m>7</m> <s>0</s>T0:7:0 </MediaTimePoint> <MediaDuration> <m>3</m>PT3M </MediaDuration> </MediaTime> </MediaTimeMask>

<GoFGoPHistogramD HistogramTypeInfo = "Average"> <!— Value of GoFGoPHistogram D --> </GoFGoPHistogramD> <SegmentDecomposition Gap = "true" Overlap = "true" DecompositionType = "temporal"> <VideoSegment id = "VS2" > <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> PT5M<m>5</m> </MediaDuration> </MediaTime> <GoFGoPHistogramD HistogramTypeInfo = "Average"> </GoFGoPHistogramD> </VideoSegment> <VideoSegment id = "VS4" > <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> PT6M<m>6</m> </MediaDuration> </MediaTime> <GoFGoPHistogramD HistogramTypeInfo = "Average">

</GoFGoPHistogramD> </VideoSegment> </SegmentDecomposition</VideoSegment>

Editor's Note: The MediaTime description should be updated.

10.1.2.5 Description UseVideo segment descriptions can be used for retrieval, browsing, and visualization applications.

10.1.3 StillRegion DS A StillRegion DS is a specific Segment DS. It inherits all the properties of the Segment DS (attributes, decomposition, descriptors and DSs). The StillRegion DS describes a spatial area (e.g. set of pixels) belonging to a still image or a single frame of a video sequence. The still images can be natural or synthetic images. A still image is a particular case of StillRegion DS. The spatial area may be connected or not. This is defined by the SpatialConnectivity attribute.



10.1.3.3 Description ExtractionThe extraction of a still region is the result of a spatial segmentation or a spatial decomposition. This may be done automatically or by hand, on semantics criteria or not. Several techniques are reported in [1], [12]. Note that this kind of analysis tools have been extensively studied in the framework of the MPEG-4 standard. Some examples are described in [11], [12], [17]. In this section, we describe one example proposed in [17].

The extraction of individual spatial or temporal regions and their organization within a tree structure can be viewed as a hierarchical segmentation problem also known as a partition tree creation. The strategy presented here involves three steps illustrated in Figure 7: the first step is a conventional segmentation. Its goal is to produce an initial partition with a rather high level of details. Depending on the nature of the segment tree, this initial segmentation can be a shot detection algorithm (for VideoSegment) or a spatial segmentation following a color homogeneity criterion (for Regions). The second step is the creation of a Binary Partition Tree. Combining the segments (VideoSegments or regions) created in the first step, the Binary Partition Tree defines the set of segments to be indexed and encodes their similarity with the help of a binary tree. Finally, the third step restructures the binary tree into an arbitrary tree. Although the approach can be used for many types of segments, the following description assumes that we are dealing with still regions.

Figure 7: Outline of segment tree creation

The first step is rather classical (see [1] and the reference herein for example). The second step of the Segment tree creation is the computation of a Binary Partition Tree. The node of the tree represents connected components in space (regions). The leaves of the tree represent the regions defined by the initial segmentation step. The remaining nodes represent segments that are obtained by merging segments represented by the children. This representation should be considered as a compromise between representation accuracy and processing efficiency. Indeed, all possible merging of initial segments are not represented in the tree. Only the most "useful'' merging steps are represented. However, the main advantage of the binary tree representation is that efficient techniques are available for its creation and it convey enough information about segment similarity to construct the final tree. The Binary Partition Tree should be created in such a way that the most "useful'' segments are represented. This issue can be application dependent. However, a possible solution, suitable for a large number of cases, is to create the tree by keeping track of the merging steps performed by a segmentation algorithm based on merging. This information is called the merging sequence. The process is illustrated in Figure 8. The original partition involves four regions. The regions are indicated by a letter and the number indicates the mean gray level value. The algorithm merges the four regions in three steps.

Figure 8: Example of Binary Partition Tree creation with a region merging algorithm

To create the Binary Partition Tree, the merging algorithm may use several homogeneity criteria based on low-level features. For example, if the image belongs to a sequence of images, motion information can be used to generate the tree: in a first stage, regions are merged using a color homogeneity criterion, whereas a motion homogeneity criterion is used in the second stage. Figure 9 presents an example of Binary Partition Tree created with color and motion criteria on the Foreman sequence. The nodes appearing in the lower part of the tree as white circles correspond to the color criterion, whereas the dark squares correspond to the motion criterion. As can be seen, the process starts with a color criterion and then, when a given Peak Signal to Noise Ratio (PSNR) is reached, it changes to the motion criterion. Regions that are homogeneous in motion as the face and helmet are represented by a single node (B) in the tree.

Figure 9: Examples of creation of Binary Partition Tree with color and motion homogeneity criteria

Furthermore, additional information about previous processing or detection algorithms can also be used to generate the tree in a more robust way. For instance, an object mask can be used to impose constraints on the merging algorithm in such a way that the object itself is represented as a single node in the tree. Typical examples of such algorithms are face, skin, character or foreground object detection. An example is illustrated in Figure 10. Assume for example that the original Children image sequence has been analyzed so that masks of the two foreground objects are available. If the merging algorithm is constrained to merge regions within each mask before dealing with remaining regions, the region of support of each mask will be represented as a single node in the resulting tree. In Figure 10, the nodes corresponding to the background and the two foreground objects are represented by squares. The three sub-trees further decompose each object into elementary regions.

Figure 10: Example of partition tree creation with restriction imposed with object masks

The purpose of the third and last step of the algorithm is to restructure the Binary Partition Tree into an arbitrary tree that should reflect more clearly the image structure. To this end, nodes that have been created by the binary merging process but that do not convey any relevant information should be removed. The criterion used to decide if a node must appear in the final tree can be based on the variation of segments homogeneity. A segment is kept in the final tree if the homogeneity variation between itself and its parent is low. Furthermore, techniques pruning the tree can be used since not all regions may have to be described. Finally, specific GUI can also be designed to manually modify the tree to keep the useful sets of segments. An example of simplified and restructured tree is shown in Figure 11.

Figure 11: Example of restructured tree

10.1.3.4 Description ExampleA tree of still regions may be used for example to create a Table Of Contents. Used on its own, it may be used to combine different Descriptors characterizing the same still region. An example of a still region tree is included below. SR1 could represent an image in which two object, SR2 and SR3, have been segmented.

<StillRegion id = "SR1" SpatialConnectivity = "true"> <ContourShapeD PeakCount = "0" HighestPeak = "0"> <GlobalCurvatureVector> <gcv1>2511</gcv1><gcv2>8232</gcv2> </GlobalCurvatureVector> </ContourShapeD> <SegmentDecomposition Gap = "true" Overlap = "false" DecompositionType = "spatial"> <StillRegion id = "SR2" SpatialConnectivity = "true"> <ContourShapeD> </ContourShapeD> </StillRegion> <StillRegion id = "SR3" SpatialConnectivity = "true"> <ContourShapeD> </ContourShapeD> </StillRegion> </SegmentDecomposition></StillRegion>

10.1.3.5 Description UseStill region descriptions can be used for retrieval, browsing, and visualization applications.

10.1.4 MovingRegion DS A MovingRegion DS is a specific Segment DS. It inherits all the properties of the Segment DS (attributes, decomposition, descriptors and DSs). The MovingRegion DS describes a spatio-temporal area belonging to a video sequence. The spatio-temporal area may be connected in several ways. This is defined by the SpatialConnectivity attribute in space and the the MediaTimeMask DS in time.



10.1.4.3 Description ExtractionThe extraction of a moving region is the result of a spatio-temporal segmentation. This may be done automatically or by hand, based on semantics or other criteria. . An overview of spatio-temporal segmentation can be found in [1]. In this section, we describe the semi-automatic method used in the AMOS system [22] to segment and tract semantic objects in video sequences.

AMOS is a system that combines low level automatic region segmentation with an active method for defining and tracking high-level semantic video objects. A semantic object is represented as a set of underlying homogeneous regions. The system contains two stages (see Figure 12): an initial object segmentation stage where user input in the starting frame is used to create a semantic object; and an object tracking stage where underlying regions of the semantic object are tracked and grouped through successive frames.

Figure 12: General Structure of AMOS

10.1.4.3.1 Initial Semantic Object SegmentationSemantic object segmentation at the starting frame consists of several major processes as shown in Figure 13. First, users identify a semantic object by using tracing interfaces (e.g. mouse). The input is a polygon whose vertices and edges are roughly along the desired object boundary. To tolerate user-input error, a snake algorithm [10] is can be used to align the user-specified polygon to the actual object boundary. The snake algorithm is based on minimizing a specific energy function associated with edge pixels. Users may also choose to skip the snake module if a relatively accurate outline is already provided.

After the object definition, users can start the tracking process by specifying a set of thresholds. These thresholds include a color merging threshold, weights on three color channels (i.e. L*u*v*), a motion merging threshold and a tracking buffer size (see following sections for their usage). These thresholds can be chosen based on the characteristic of a given video shot and experimental results. For example, for a video shot where foreground objects have similar luminance with background regions, users may put a lower weight on the luminance channel. Users can start the tracking process for a few frames with the default thresholds which are automatically generated by the system, and then adjust the thresholds based on the segmentation and tracking results. This system also allows a user to stop the tracking process at any frame, modify the object boundary that is being tracked and then restart the tracking process from the modified frame.

Figure 13: Object segmentation at starting frame

Given the initial object boundary from users (or the snake module), a slightly extended (~15 pixels) bounding box surrounding the arbitrarily shaped object is computed. Within the bounding box, three feature maps, edge map, color map, and motion field, are created from the original images. Color map is the major feature map in the following

segmentation module. It is generated by first converting the original image into the CIE L*u*v* color space and then quantizing pixels to a limited number of colors (e.g. 32 or 16 bins) using a clustering based (e.g. K-Means) method. The edge map is a binary mask where edge pixels are set to 1 and non-edge-pixels are set to 0. It is generated by applying the Canny edge detection algorithm. Motion field is generated by a hierarchical block matching algorithm [1]. A 3-level hierarchy is used as suggested in [1].

The intra-frame segmentation module is based on an automatic region segmentation algorithm using color and edge [23],[24]. As stated in [23],[24], color-based region segmentation can be greatly improved by fusion with edge information. Color based region merging process works well on quantized and smoothed images. On the contrary, edge detection captures high-frequency details in an image. In AMOS, to further improve the accuracy, a motion-based segmentation process using the optical flow is applied to segmented color regions to check the uniformity of the motion distribution. Although the complete process utilizing color, edge, and motion is not trivial, the computational complexity is greatly reduced by applying the above region segmentation process only inside the bounding box of the snake object instead of the whole frame.

The region aggregation module takes homogeneous regions from the segmentation and the initial object boundary from the snake (or user input directly). Aggregation at the starting frame is relatively simple compared with that for the subsequent frames, as all regions are newly generated (not tracked) and the initial outline is usually not far from the real object boundary. A region is classified as foreground if more than a certain percentage (e.g. 90%) of the region is included in the initial object. On the other hand, if less than a certain percentage (e.g. 30%) of a region is covered, it is considered as background. Regions between the low and high thresholds are split into foreground and background regions according to the intersection with the initial object mask.

Finally, affine motion parameters of all regions, including both foreground and background, are estimated by a multivariate linear regression process over the dense optical flow inside each region. In our system, a 2-D affine model with 6 parameters is used. These affine models will be used to help track the regions and object in the future frames, as we will discuss in the next section.

10.1.4.3.2 Semantic Object TrackingGiven the object with homogeneous regions constructed at the starting frame, tracking in the successive frames is achieved by motion projection and an inter-frame segmentation process. The main objectives of the tracking process are to avoid losing foreground regions and to avoid including false background regions. It contains the following steps (see Figure 14).

Figure 14: Automatic semantic object tracking

First, segmented regions from the previous frame, Object including both foreground and background, are projected onto the current frame (virtually) using their individual affine motion models. Projected regions keep their labels and original classifications. For video shots with static or homogeneous background (i.e. only one moving object), users can choose not to project background regions to save time.

Generation of the three feature maps (color, edge and motion) utilizes the same methods as described in the previous section. The only difference is that in the quantization step, the existing color palette computed at the starting frame is directly used to quantize the current frame. Using a consistent quantization palette enhances the color consistency of segmented regions between successive frames, and thus improves the performance of region based tracking. As object tracking is limited to single video shots, in which there is no abrupt scene change, using one color palette is generally

valid. Certainly, a new quantization palette can be generated automatically when a large quantization error is encountered.

In the tracking module (i.e. inter-frame segmentation), regions are classified into foreground, background and new regions. Foreground or background regions tracked from the previous frame are allowed to be merged with regions of the same class, but merging across different classes is forbidden. New regions can be merged with each other or merged with foreground/background regions. When a new region is merged with a tracked region, the merging result inherits its label and classification from the tracked region. In motion segmentation, split regions remain in their original classes. After this inter-frame tracking process, we obtain a list of regions temporarily tagged as either foreground, background, or new. They are then passed to an iterative region aggregation process.

The region aggregation module takes two inputs: the homogeneous region and the estimated object boundary. The object boundary is estimated from projected foreground regions. Foreground regions from the previous frame are projected independently and the combination of projected regions forms the mask of the estimated object. The mask is refined with a morphological closing operation (i.e. dilation followed by erosion) with a size of several pixels in order to close tiny holes and smooth boundaries. To tolerate motion estimation error that may cause the loss of foreground regions around object boundary, the mask is further dilated with the tracking buffer size, which is specified by users at the beginning of the tracking.

The region aggregation module implements a region grouping and boundary alignment algorithm based on the estimated object boundary as well as the edge and motion features of the region. Background regions are first excluded from the semantic object. For every foreground or new region, compute intersection ratio of the region with the object mask. Then if:

1) the region is foreground If it is covered by the object mask by more than 80%, it belongs to the semantic object. Otherwise, the region is intersected with the object mask and split :

a) split regions inside the object mask are kept as foregroundb) split regions outside the object mask are tagged as new

2) the region is new If it is covered by the object mask by less than 30%, keep it as new; Else if the region is covered by the object mask by more than 80%, classify it as foreground. Otherwise:

a) Compute numbers of edge pixels (using the edge map) between this region and the current background and foreground regions. Compute differences between the mean motion vector of this region with those of its neighboring regions and find the neighbor with the most similar motion.

b) If the region is separated from background regions by more edge pixels than foreground regions (or if this region is not connected to any background regions) and its closest motion neighbor is a foreground region, intersect it with the object mask and split :

- split regions inside the object mask are classified as foreground- split regions outside the object mask are tagged as new

c) Otherwise, keep the region as new.

Compared with the aggregation process in the previous section, a relatively lower ratio (80%) is used to include a foreground or new region. This is to handle motion projection errors. As it is possible to have multiple layers of new regions emerge between the foreground and the background, the above aggregation and boundary alignment process is iterated multiple times. This step is useful in correcting errors caused by rapid motions. At the end of the last iteration, all remaining new regions are classified into background regions. Finally, affine models of all regions, including both foreground and background, are estimated. As described before, these affine models are used to project regions onto the future frame in the motion projection module.

10.1.4.4 Description ExampleA tree of moving regions may be used for example to create a Table Of Contents. Used on its own, it may be used to combine different Descriptors characterizing the same moving region. . An example of a video segment decomposed into two levels of moving regions is included below. VS4 could be a video scene in which two semantic objects, MR5 and MR8, have been segmented and tracked using the system described in the previous section. MR6-7 and MR9-10 could correspond to homogeneous regions of the semantic objects MR5 and MR9, respectively.

<VideoSegment id = "VS4" > <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> PT10M<m>10</m> </MediaDuration>

</MediaTime>

<GoFGoPHistogramD> </GoFGoPHistogramD> <SegmentDecomposition Gap = "true" Overlap = "true" DecompositionType = "spatio-temporal"> <MovingRegion id = "MR5" SpatialConnectivity = "false" > <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> PT10M<m>10</m> </MediaDuration> </MediaTime> <MediaTimeMask NumberOfIntervals = "2"> <MediaTime> <MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint> <MediaDuration> <m>6</m>PT6M </MediaDuration> </MediaTime> <MediaTime> <MediaTimePoint> <h>0</h> <m>7</m> <s>0</s>T0:7:0 </MediaTimePoint> <MediaDuration> <m>3</m>PT3M </MediaDuration> </MediaTime> </MediaTimeMask>

<ParametricObjectMotionD> </ParametricObjectMotionD> <SegmentDecomposition Gap = "true" Overlap = "true" DecompositionType = "spatio-temporal"> <MovingRegion id = "MR6" SpatialConnectivity = "false"> </MovingRegion> <MovingRegion id = "MR7" SpatialConnectivity = "true"> </MovingRegion> </SegmentDecomposition> </MovingRegion> <MovingRegion id = "MR8" SpatialConnectivity = "true"> <MediaTime> <MediaTimePoint> T0:0:0<h>0</h> <m>0</m> <s>0</s> </MediaTimePoint> <MediaDuration> PT6M<m>6</m> </MediaDuration> </MediaTime> <ParametricObjectMotionD> </ParametricObjectMotionD> <SegmentDecomposition Gap = "true" Overlap = "true" DecompositionType = "spatio-temporal"> <MovingRegion id = "MR9" SpatialConnectivity="false"> </MovingRegion> <MovingRegion id = "MR10" SpatialConnectivity = "true"> </MovingRegion> </SegmentDecomposition> </MovingRegion> </SegmentDecomposition></VideoSegment>

Edditor's Note: The MediaTime description should be updated.

10.1.4.5 Description UseMoving region descriptions can be used for retrieval, browsing, and visualization applications. This section describes a query model for similarity searching of video objects based on localized visual features of the objects and the objects’ regions, and spatio-temporal relationships among the objects’ regions [25]. In moving region descriptions, the video objects will correspond to the first level of moving regions (MR5 in 8.1.4.4); the objects’ regions to the second level of moving regions (MR6 and MR7 in 8.1.4.4); the visual features to visual descriptors; and the spatio-temporal relationships to segment relationships in segment relationship graphs.

Let’s consider that each video object has a unique ObjectID. Similarly each region also has a unique RegionID. Several visual features (e.g. GoF Color Histogram) are associated with ObjectID and RegionID. In addition, ObjectID is associated spatio-temporal feature vectors as described in 10.1.6.3. These ID’s are used as index in the description of the object matching and retrieval processes.

Given a query object with N regions, the searching approach in [25] consists of two stages:(1) Region Search: to find a candidate region list for each query region based on visual features and spatio-

temporal relationships(2) Joint & Validation: to join these candidate region lists to produce the best matched video objects by

combining visual and structure similarity metrics and to compute the final global distance measure.The video object query model is shown in Figure 15.

Figure 15: The video object query model

The detailed procedure follows:1) For every query region, find a candidate region list based on the weighted sum (according to the weights

given by users) of distance measures of different visual features (e.g., shape or trajectory). All individual feature distances are normalized to [0,1]. Only regions with distances smaller than a threshold are added to a candidate list. Here the threshold is a pre-set value used to empirically control the number or percentage of objects the query system will return. For example, a threshold 0.3 indicates that users want to retrieve around 30 percent of video objects in the database (assuming the feature vectors of the video objects in the database have normal distribution in feature spaces). The threshold can be set to a large value to ensure completeness, or a small value to improve speed.

2) Sort regions in each candidate region list by their ObjectID’s.3) Perform join (outer join) of the region lists on ObjectID to create a candidate object list. Each candidate

object, in turn, contains a list of regions. A "NULL" region is used when : - a region list doesn’t contains regions with the ObjectID of a being-joined object - a region appears (i.e. matched) more than once in a being-joined object

4) Compute the distance between the query object and each object in the candidate object list as follows:

D = w 0 FD (q i, r i) + w 1 SD (sog q, sog o) + w 2 SD (topo q, topo o) +w 3 SD (temp q, temp o)

where q i is the ith query region. r i is the ith region in a candidate object. FD(.) is the feature distance between a region and its corresponding query region. If r i is NULL, maximum distance (i.e., 1) is assigned. sog q (spatial orientation), topo q (topological relationship) and temp q (temporal relationship) are structure features of the query object sog o, topo o, and temp o are retrieved from database based on ObjectID, RegionID and temporal positions. When there is a NULL region (due to the above join process), the corresponding dimension of the retrieved feature vector will have a NULL value. SD(.) is the L1-distance and a penalty of maximum difference is assigned to any dimension with a NULL value.

5) Sort the candidate object list according to the above distance measure D and return the result.

10.1.5 VideoText DS A VideoText DS is a specific MovingRegion DS. It inherits all the properties of the MovingRegion DS (attributes, decomposition, descriptors and DSs). The VideoText DS describes a text region on a still image or a set of video frames. The text region is described by the superimposed characters, syntactic attributes of the text such as language, font size, font style, and other information such as its time, color, motion, and spatial location through its derivation from the MovingRegion DS.




<simpleType name="TextDataType" base="string"><enumeration value="Superimposed"/><enumeration value="SceneEmbedded"/>

</simpleType>

<complexType name="VideoText" base="mds:MovingRegion" derivedBy ="extension"><element name="Text" type="mds:TextualDescription"

minOccurs="0" maxOccurs="1"/><attribute name="TextType" type="mds:TextDataType" use="optional"/><attribute name="FontSize" type="positiveInteger" use="optional"/><attribute name="FontType" type="string" use="optional"/>

</complexType>

10.1.5.2 Description Scheme SemanticsThe VideoText inherits all the properties from the MovingRegion DS. In particular, it includes all the attributes, Ds and DSs defined in the MovingRegion DS with the same syntax and semantic.

Name DefinitionVideoText Text region in an image or a set of video frames.Text Textual description that contains the text string recognized in the video text

region. The textual description includes an attribute to specify the language of the text string.

TextDataType Datatype defining the kind of video text. The possible kinds of video text are superimposed and embedded textscene text. Embedded Scene video text appears as part of and is recorded with the scene (e.g. street and shop names or text on people's clothing). On the other hand, superimposed video text is intendted to carry and stress important information and is typically generated by video title machinges or graphical font generators in studios.

TextType Attribute that specifies the text type of the video text.FontSize Integer that specifies the font size of the video text.FontStyle String that specifies the font style of the video text.

Editor’s Note: Definitions are needed for superimposed and embedded as the types of video texts.

10.1.5.3 Description ExtractionExtraction of videotext in a frame is the result of image analysis involving text character segmentation and location. This may be done automatically or by hand, based on semantics or other criteria. In this section, we discuss how videotext can be extracted automatically from digital videos. Text can appear in a video anywhere in the frame and in different contexts. The algorithms presented here are designed to extract superimposed text and scene text which possesses typical (superimposed) text attributes. No prior knowledge about frame resolution, text location, font styles, and text appearance modes such as normal and inverse video are assumed. Some common characteristics of text are exploited in the algorithms including monochromaticity of individual characters, size restrictions (characters cannot be too small to be read by humans or too big to occupy a large portion of the frame), and horizontal alignment of text (preferred for ease of reading).

Approaches to extracting text from videos can be broadly classified into three categories: (i) methods that use region analysis, (ii) methods that perform edge analysis, and (iii) methods that use texture. The following section describes a region-based algorithm from IBM [18], which is followed by a section containing an edge-based algorithm from Philips [2].

10.1.5.3.1 Videotext Extraction Using Region Analysis

The IBM algorithm [18] for videotext extraction works by extracting and analyzing regions in a video frame. The goals of this system are (i) isolating regions that may contain text characters, (ii) separating each character region from its surroundings and (iii) verifying the presence of text by consistency analysis across multiple text blocks.

10.1.5.3.1.1 Candidate Text Region ExtractionThe first step in the IBM’s system is to remove non-text background from an input gray scale image generated by scanning a paper document, or from downloading a Web image, or by decompressing an encoded (for example, in MPEG-1, 2) video stream. The generalized region labeling (GRL) algorithm [19] is used to extract homogenous regions from this image. The GRL algorithm labels pixels in an image based on a given criterion (e.g., gray scale homogeneity) using contour traversal, thus partitioning the image into multiple regions, then groups pixels belonging to a region by determining its interior and boundaries, and extracts region features such as its MBR (minimum bounding rectangle), area, etc. The criterion used to group pixels into regions is that the gray level difference between any pair of pixels within the region cannot exceed ±10.

The GRL algorithm thus, segments the image into nonoverlapping homogenous regions. It also results in complete region information such as its label, outer and inner boundaries, number of holes within the regions, area, average gray level, gray level variance, centroid and the MBR. Next, non-text background regions among the detected regions are removed based on their size. A region is removed if the width and height of its MBR are greater than 24 and 32, respectively (can be adaptively modified depending on the image size). By employing a spatial proportion constraint, rather than area constraint, large homogeneous regions which are unlikely to be text are removed. Within the remaining candidate regions, candidate regions may be fragmented into multiple regions because of varying contrast in the regions surrounding the candidate regions. To group multiple touching regions into a single coherent region, we generate a binary image from the labeled region image where all the regions which do not satisfy the size constraint are marked "0" and the remaining regions are marked "1." This binary image is processed using the GRL algorithm to obtain new connected regions. With the creation of a binary image, followed by a relabeling step, many small connected fragments of a candidate text region are merged together.

10.1.5.3.1.2 Text Region RefinementHere the basic idea is to apply appropriate criteria to extract character segments within the candidate regions. Within a region, characters with holes can be present embedded in a complex background and since OCR systems require text to be printed against a clean background for processing, the second stage attempts to remove the background within the regions while preserving the candidate character outline. Since character outlines in these regions can be degraded and merged with the background, an iterative local thresholding operation is performed in each candidate region to separate the region from its surroundings and from other extraneous background contained within its interior . Once thresholds are determined automatically for all candidate regions, we compute positive and negative images, where the positive image contains region pixels whose gray levels are above their respective local thresholds and the negative image contains region pixels whose gray levels fall below their respective thresholds. Observe that the negative image will contain candidate text regions if that text appears in inverse video mode. All the remaining processing steps are performed on both positive and negative images and their results are combined. Thus the IBM’s system can handle normal and inverse video appearances of text.

Wefurther sharpen and separate the character region boundaries by performing a region boundary analysis. This is necessary especially when characters within a text string appear connected with each other and they need to be separated for accurate text identification. This is achieved by examining the gray level contrast between the character region boundaries and the regions themselves. For each candidate region R, a threshold T is computed:

where Icbk is the gray level of the pixel k on the circumscribing boundaries of the region and I il is the gray level of the pixel l belonging to R (including interior and region boundary), Ncb is the number of pixels on the circumscribing boundaries of the region, and Ni is the number of pixels in the region. A pixel is defined to be on the circumscribing boundary of a region if it does not belong to the region but at least one of its four neighbors (using 4-connectivity) does. Those pixels in R whose gray level is less than T are marked as belonging to the background and discarded, while the others are retained in the region. Note that this condition is reversed for the negative image. This step is repeated until the value of T does not change over two consecutive iterations.

10.1.5.3.1.3 Text Characteristics VerificationThe few candidate character regions are now tested for exhibiting typical text font characteristics. A candidate region is removed if its area is less than 12 or its height is less than 4 pixels because small fonts are difficult to be recognized by

OCR systems. It is also removed if the ratio of the area of its MBR to the region area (fill factor) is greater than 4. Finally it may be removed if the gray level contrast with the background is low, i.e., if

where Icbk is the gray level of the pixel k on the circumscribing boundaries of the region and Ibl is the gray level of the pixel l on the boundaries of the region. Since region boundary information is easily available owing to our GRL algorithm, this new boundary-based test can be easily performed to handle the removal of noisy non-text regions. Note also that the parameters used were determined with a study of a large number of SIF-resolution videos and were kept stable during our experimentation.

10.1.5.3.1.4 Text Consistency AnalysisConsistency between neighboring text regions is verified to eliminate false positive regions. The system attempts to ensure that the adjacent regions in a line exhibit the characteristics of a text string, thus locally verifying the global structure of the line. This text consistency test includes (i) position analysis that checks interregion spacing. The width between the centroids of the MBRs of a pair of neighboring regions that are retained is less than 50 pixels; (ii) horizontal alignment analysis of regions. The vertical centers of neighboring MBRs is within 6 pixels of one another; (iii) vertical proportions analysis of adjacent regions. The height of the larger of the two regions is less than twice the height of the smaller region.

Given a candidate text string, we also perform a final series of tests involving their MBRs. The MBRs of the regions (characters) are first verified to be present along a line within a given tolerance of 2 pixels. Observe that characters present along a diagonal line can be therefore easily identified as a string in the IBM’s system. The interregion distance in the string is verified to be less than 16 pixel. We also ensure that the MBRs of adjacent regions do not overlap by more than 2 pixels. If all three conditions are satisfied, we retain the candidate word region as a text string. The final output is a clean binary image containing only the detected text characters (appearing as black on white background) that can be directly used as input to an OCR system in order to be recognized.

10.1.5.3.1.5 Interframe analysis for text refinementOptionally, if consecutive frames in videos are being processed together in a batch job, then text regions determined from say, five consecutive frames can be analyzed together to add missing characters in frames and to delete incorrect regions posing as text. This interframe analysis used by the IBM system to handle videos exploits the temporal persistence of videotext, and it involves examination of the similarity of text regions in terms of their positions, intensities and shape features and aids in omitting false positive regions.

10.1.5.3.2 Text Detection Based on Edge CharacterizationThe Philips algorithm [2] for text detection exploits the text properties namely, the height, width and area on the connected components (CC) of the edges detected in frames. Further, horizontal alignment is used to merge multiple CC's into a single line of text. The purpose is to output a thresholded image of the detected text lines with text as foreground in black on a white background. This can be the input to an OCR to recognize the text characters.

Text extraction is performed on individual video frames. The steps involved in text extraction are given below. The origin (0,0) of the frame is the top-left corner. Any pixel is referenced by (x, y) location where x, is the position in columns and y, in rows.

10.1.5.3.2.1 Channel SeparationWe use the red frame of the RGB color space to make it easy to differentiate the colors white, yellow and black, which dominate videotext. By using the red frame, we obtain sharp high-contrast edges for these frequent text colors. However, other color spaces such as HSB or YUV could be used.

10.1.5.3.2.2 Image EnhancementThe frame’s edges are enhanced using a 3 3-mask. Noise is further removed using a median filter.

10.1.5.3.2.3 Edge DetectionOn the enhanced image, edge detection is performed using the following 3 3 filter.

-1 -1 -1

-1 12 -1

-1 -1 -1

Excluding the image borders, edges are found when the output is smaller than EdgeThreshold. Currently, the threshold is fixed; however, a variable threshold could be used. The fixed threshold results in a lot of salt and pepper noise; also, the edges around the text may be broken and not connected. Hence, further processing is needed.

10.1.5.3.2.4 Edge FilteringA preliminary edge filtering is performed to remove areas that possibly do not contain text or, even if they do, they cannot be reliably detected. Edge filtering can be performed at different levels. One is at a frame level and the other is at a sub-frame level. On the frame level, if more than a reasonable portion of the frame contains edge pixels, probably due to the number of scene objects, we disregard the frame and take the next. This can lead to the loss of text in some clean areas and result in false negatives. To overcome this problem, edge filtering is performed at a sub-frame level. To find text in an "over crowded" frame, we maintain six counters to keep the count of the subdivided frame. Three counters are used for three vertical portions of the frame (one third of the area of the frame). Similarly, three counters are used for three horizontal stripes. Text lines found in high-density edge areas (stripes) are rejected in a subsequent step. This filtering could be done using smaller areas, to retain areas that are clean and contain text in a region smaller than one-third of an image.

10.1.5.3.2.5 Character DetectionNext a Connected Component (CC) analysis is performed on leftover edges. Text characters are assumed to give rise to connected components or a part thereof. All the edge pixels that are located within a certain distance from each other (we use an eight-pixel neighborhood) are merged in CCs. Each of the CCs is tested for size, height, width and area criteria before passing to the next stage.

10.1.5.3.2.6 Text Box DetectionThe connected components that pass the criteria in the previous step are sorted in ascending order based on the location of the bottom left pixel. The sorting is done in raster scan. This list is traversed and the CCs are merged together to form boxes of text. The first connected component, CC1 is assigned to the first box. Each subsequent CC i is tested to see if the bottom most pixel lies within a preset acceptable "row" threshold from the bottom most pixel of the current text box. If the CCi lies within a few rows (in our case 2 rows) of the current box, there is a good chance that they belong to the same line of text. The row difference threshold, currently is a fixed one, but could be a variable one also. It could be made a fraction of the height of the current text box. In order to avoid merging CCs that are too far away in the image, a second test is performed to see if the column distance between CC i and the text boxes is less than a column threshold. This threshold is variable and is a multiple of the width of CC i. CCi is merged to the current text box if the above is true. If CCi does not merge into the current text box, then a new text box is started with CC i as its first component and the traversing is continued.The above process could result in multiple text boxes for a single line of text in the image. Now for each of the text boxes formed by the character merging, a second level of merging is performed. This is to merge the text boxes that might have been mistakenly taken as separate lines of text, either due to strict CC merging criteria or due to poor edge detection process resulting in multiple CCs for the same character.

Each box is compared to the text boxes following it for a set of conditions. If two boxes are merged, the second box is deleted from the list of text boxes and merged into the first box. The multiple test conditions for two text boxes are: a) the bottom of one box is within the row difference threshold of the other. Also the distance between the two boxes in the horizontal direction is less than a variable threshold depending on the average width of characters in the first box. b) The center of either of the boxes lies within the area of the other text box or c) the text boxes overlap. If any of the above conditions is satisfied, the two text boxes are merged until all text boxes are tested against each other.

10.1.5.3.2.7 Text Line Detection and EnhancementThe leftover boxes are accepted as text lines if they conform to the constraints of area, width and height. For each of the boxes, the corresponding original sub-image is thresholded to obtain the text as foreground in black and everything else as white. This is required so that the binary image can be inputted to an OCR. The average grayscale value of the pixels in the box is calculated. The average grayscale, AvgBG, value of a region (5 pixels in our case) around the box calculated. Within the box, anything above the average is marked as white and anything below it is marked as black. The grayscale average for the pixels being marked as white, Avg1, is calculated along the average of the black pixels, Avg2. Once, the box is converted to a black and white image (binary image), the average of the "white region"(Avg1) and the average of the "black region"(Avg2) are compared to the AvgBG (as shown in Figure 16). The region that has its average closer to the AvgBG is assigned to be the background and the other region is assigned to be the foreground. In other words, if the "black region" has its average closer to the other average, it is converted to white and vice versa. This assures that the text is always in black.

1.1.1.1.1.1.1Avg1Avg2

5 pixels

Figure 16: Separation of text foreground from background.


<VideoText id="VideoText1" SpatialConnectivity="true"FontSize="40" FontStyle="Courier New" TextType="Superimposed">

<MediaTime><MediaTimePoint> <h>0</h> <m>0</m> <s>0</s>T0:0:0 </MediaTimePoint><MediaDuration> PT6M<m>6</m> </MediaDuration>

</MediaTime><ParametricObjectMotion> </ParametricObjectMotion><Text xml:lang="en-us">Victory</Text>

</VideoText>

10.1.5.5 Description UseThe applications described here highlight video browsing scenarios in which interesting events (i.e., the presence of videotext as an indicator of information pertaining to persons, locations, product advertisements, sports scores, etc) in the video are detected automatically and the video content is browsed based on these events, and video classification scenarios.

The application with an event-based video browsing capability shows that frames containing videotext can be automatically determined from a digital video stream. The stream can be marked as those sections containing videotext and those who do not by automatically determining the contiguous groups of time intervals of frames that contain text and which do not. Consequently, the video can be browsed in a nonlinear and random-access fashion based on the occurrence of a specific event. The event in this case is the presence or absence of videotext. A graphical summary of the video shows where videotext annotation is present along the video timeline.

With an application demonstrating video classification, the use of videotext in conjunction with other video features (annotations) for classification can be shown. Videotext annotation in news programs along with the detection of talking heads results in automatically labeled anchor shots.

10.1.6 AudioSegment DS An AudioSegment DS is a specific Segment DS devoted to audio information. It inherits all the properties of the Segment DS (attributes, decomposition, descriptors and DSs). The AudioSegment DS describes a temporal segment: a group of samples from an audio program. The audio segment may be a composed of a single interval or not. This is defined by the TimeMask DS.

10.1.6.1 Description Scheme SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).s


10.1.6.3 Description ExtractionTo be done in function of the relevant DSs and Ds included in the audio segment.

10.1.6.4 Description ExampleTo be done in function of the relevant DSs and Ds included in the audio segment.

10.1.6.5 Description UseTo be done in function of the relevant DSs and Ds included in the audio segment.

10.2 Segment FeaturesSegment features are descriptors and description schemes representing features of segments. Segment features are the MediaTimeMask for temporal segments, i.e. VideoSegment DS, MovingRegion DS, and AudioSegment DS; the Mosaic DS for VideoSegment DSs; and the MatchingHint DS and the PointOfView DS for the Segment DS. The visual features of VideoSegment DS, StillRegion DS, and MovingRegion DS are defined in the Video XM and WD documents. The audio features of AudioSegment DS are defined in the Audio WD document.

10.2.1 MediaTimeMask DS The MediaTimeMask DS defines a collection of non-overlapping connected subintervals in time.



10.2.1.3 Description Extraction(See VideoSegment DS).

10.2.1.4 Description Example(See VideoSegment DS).

10.2.1.5 Description Use(See VideoSegment DS).

10.2.2 Mosaic DS

A mosaic of a video shot is constructed by aligning and warping the frames of the shot upon each other in a single reference system giving a panoramic view of the whole shot in one single image. Because of the general redundancy of data in subsequent frames, this results in a very effective summarization, supporting video to Image transcoding and Universal Multimedia Access (UMA) applications. Given the warping parameters for the frames and the resulting mosaic the process can also be inverted, which gives the possibility to recover any desired number of the initial frames.



10.2.2.3 Description ExtractionBecause of the general kind of video that is the target for MPEG-7 to describe, it is very important to be able to construct good mosaics even from video shots that are not pre-segmented, videos with lots of free moving objects and videos where motion is large and subsequent frames therefore are largely disaligned. Thus a very stable and robust algorithm is important for handling these situations. The algorithm described below is described in detail in [20]. Below a brief description is given.

As input to a mosaic algorithm is a set of video frames and as the final output there is the resulting mosaic image of these frames. Generally, the mosaicing procedure can be divided into two main parts. First the set of frames have to be aligned with each other, and then follows the construction of the mosaic where the data is merged into one single image.The merging can be done with several criteria, for instance averaging of the frame data or pasting with the latest frame. But before the merging can be done, alignment of the frames has to be performed, and the method of alignment is critical for the quality and robustness of the algorithm.

The aim when developing a mosaic construction method has been to make an algorithm that is robust and reliable, well prepared to deal with unsegmented video containing free-moving objects and large frame disalignment. In order to achieve this, the algorithm is based on minimization of an error measure for the alignment of the frames.

This alignment process between frames is described in detail in the Visual XM in the context of the Parametric Motion Descriptor. When all frames have been aligned, the mosaic can be constructed by merging the data from all frames into the same reference system. Generally in the entire algorithm, if masks are present, only data that are not masked out is used in the calculations.

10.2.2.4 Description ExamplesA mosaic allows the representation and visualization of a whole shot by a single image. This functionality is also provided by key frames in a different fashion. A mosaic contains more visual information than a single selected frame, but is not suitable for all kinds of shots, thus mosaics and key frames are complementary and can coexist in the same system.

Figure 17 shows a mosaic for the MPEG-7 test sequence 'Clinton' generated of 183 frames, providing the visual summary of the whole shot. This image itself can be described using the StillRegion DS. The instantiation of the MosaicWarpingParameters DS for this example looks as follows (note that here only the frame number is given instead of the full instantiation of the MediaTime DS):

<MosaicWarpingParameters ModelType="6" // affine model SourceSequenceWidth="352" // CIF SourceSequenceHeight="288" Xoffset="2" Yoffset"7" Xorigin="176" Yorigin"="144" // Center of Source NoOfMotionParameterSets = 183> // for each frame <MediaTime> ... </MediaTime> // frame number <MotionParameters> 0.0 </MotionParameters> <MotionParameters> 0.0 </MotionParameters> <MotionParameters> 0.0 </MotionParameters> <MotionParameters> 0.0 </MotionParameters> <MotionParameters> 0.0 </MotionParameters> <MotionParameters> 0.0 </MotionParameters> <MediaTime> .... </MediaTime> // frame number <MotionParameters> -3,607092 </MotionParameters> <MotionParameters> 0,736747 </MotionParameters> <MotionParameters> 0,003689 </MotionParameters> <MotionParameters> 0,001071 </MotionParameters> <MotionParameters> -0,000175 </MotionParameters> <MotionParameters> 0,005172 </MotionParameters>::</MosaicWarpingParameters>

Figure 17: Mosaic for MPEG-7 test sequence 'Clinton' over 183 frames, affine model.

Below show examples of mosaics from the ‘Parliament’ sequence, first using normal frame-rate and then a very low frame-rate in the construction.

Figure 18: Mosaic of "Parliament" sequence constructed from 120 frames, equal to 30 frames/sec.

Figure 19: Mosaic of "Parliament" sequence constructed from 12 frames, equal to 3 frames/sec.

10.2.2.5 Description UseBeside its use for visual summarization, mosaics offer important means for performing search and comparison of video sequences. Image features can be extracted from the mosaic yielding a stable description of a whole video sequence. Also, the warping information by itself can be used as a good summarizing description of a video sequence, because it describes the global motion in the video and thus is suitable for searching and comparison of videos. Another functionality is video editing. After constructing the mosaic from a video sequence it can be manipulated, using common image processing tools. The changes are then included in all frames of the reconstructed sequence.

10.2.3 MatchingHint DS The MatchingHint DS describes the relative importance of low-level descriptors and/or components of low-level descriptors.




<simpleType name="ZeroToOneDecimalDataType" base="decimal">

<minInclusive value="0.0"/><maxInclusive value="1.0"/>

</simpleType>

<complexType name="MatchingHintValue" base="mds:ZeroToOneDecimalDataType"

derivedBy="extension"><attribute name="idref" type="IDREF" use="optional"/><attribute name="DescriptorName" type="string" use="optional"/>

</complexType>

<complexType name="MatchingHint">

<element name="MatchingHintValue" type="mds:MatchingHintValue"minOccurs="1" maxOccurs="unbounded"/>

<attribute name="id" type="ID" use="optional"/><attribute name="Reliability" type="mds:ZeroToOneDecimalDataType"

use="default" value="1.0"/></complexType>

Editor’s Note: The MatchingHint Ds requires descriptors to include an id attribute of type ID. This is not implemented in most of the descriptors such as the video descriptors. The option of using Xpointer should be investigated.


Semantics of the MatchingHintValue DS

Name DefinitionMatchingHintValue Decimal within 0.0 and 1.0 which indicates the matching hint value. 1.0 indicates

the most important while 0.0 indicates the least important.ZeroToOneDecimalDataType Datatype defining a decimal within 0.0 and 1.0.idref Identifier of the descriptor (or component of descriptor) being weighted.DescriptorName Name of the descriptor (or component of descriptor) being weighted. This

component can be used when the descriptor being weighted has no id.

Semantics of the MatchingHintDS

Name DefinitionMatchingHint A DS for describing hints for matching, e.g., importance of descriptors or

components of each descriptor.id Identifier of a matching hint description.Reliability Decimal within 0.0 and 1.0 which indicates how reliable the matching hints are.

For example, if the reliability is low (close to 0.0), the matching hint value is not trustworthy; on the other hand, if the reliability is high (close to 1.0), the matching hint value represents the importance of the target description very well.

MatchingHintValue Matching hint value within 0.0 and 1.0, and reference or name of the descriptor (or component of descriptor) being weighted.

10.2.3.3 Description ExtractionThere is no limitation in the extraction methods for the MatchingHint DS. But the automatic extraction method by relevance feedback can work well on extracting the values of the DS. In this section, we briefly describe the automatic extraction method for MatchingHints.

The automatic extraction of MatchingHints is based on the learning algorithm using the cluster information describing which data are in the same cluster. This information can also be given by relevance feedback. The descriptor which classifies the data in the same cluster better has the higher hint value and the descriptor which misclassifies the data in different clusters into the same cluster has the lower hint value. The hint value can be learned continuously if there is more feedback. To prevent wrong learning from the wrong feedback, we use the reliability describing how reliable the hint values are. In other words, the reliability describes how stable the current hint values are. The reliability becomes high when there are frequent feedback from experts. The higher the reliability is, the less a feedback affects the current hint value.

The equation of the learning MatchingHint value follows as:

The system calculates matching hints with the feedback, and updates and stores matching hints of an image (or a part of a video). Updating matching hints(Wnew) are based on the matching hints(Wt) currently stored in the image (or the part of a video) and the matching hints(Wc) derived from feedback, as shown in the above equation. In the equation, updated matching hint(Wnew) is affected by Wt and Wc in proportion to the current reliability and the effect value. In other words, the updated matching hint(Wnew) is more affected by Wt when the reliability is higher, and more affected by Wc when the effect value is higher. The effect value, ec is high when the user giving feedback is the expert authorized, but generally, it is fixed as a constant.

The matching hints Wc can be calculated by the following equation from the feedback. If feedback is a relevant image, then a matching hint of a descriptor becomes higher when the similarity calculated using this descriptor is high. On the contrary, in the case of irrelevance image, a matching hint of a descriptor becomes higher when the distance calculated using this descriptor is high.

Reliability is updated by the following equation. The reliability become high when there are frequent feedback and continuous increase of performance by learning

Rnew=f(Rold+e(IncreaseTerm+a))IncreaseTerm=(Precision(t)-Precision(t-1))Rnew : new updated reliabilityRold : current reliabilitye :effect coefficienta : constant valuePrecision(t) : the precision calculated from feedback at the time tPrecision(t-1) : the precision calculated from feedback at the time t-1f(): normalizing function

10.2.3.4 Description ExampleThe following example shows how the MatchingHint DS can be used to attach importance (or credibility) values to the descriptors of a region in a still image. In this example, the hints could be used to determine the importance of the descriptors in the image matching.

<StillRegion id="IMG0001"><MatchingHint id="Weight01" reliability="0.900000">

<MatchingHintValue DescriptorName="ColorHistogram"> 0.455

</MatchingHintValue><MatchingHintValue DescriptorName ="DominantColor">

0.265</MatchingHintValue><MatchingHintValue DescriptorName ="TextureHistogram">

0.280</MatchingHintValue>

</MatchingHint><ColorHistogram>

<-- Details omitted --></ColorHistogram><DominantColor>

<-- Details omitted --></DominantColor><TextureHistogram>

<-- Details omitted --></TextureHistogram><-- Other Details omitted -->

</StillRegion>

10.2.3.5 Description UseThe MatchingHint DS can be used to represent importance of low-level descriptors for matching data. For example, we can retrieve the data of interest using a combination of descriptors with different matching hints. Because the important descriptor in the retrieval based on the similarity matching is different depending on the data, the hint for the importance of descriptor can improve the retrieval performance.

1.1.1 PointOfView DS The PointOfView DS describes the relative importance of a segment given a specific viewpoint. The viewpoint is the criteria for the relative importance.




<complexType name="PrimitiveImportanceType"

base="mds:ZeroToOneDecimalDataType" derivedBy="extension"><attribute name="idref" type="IDREF" use="optional"/>

</complexType>

<complexType name="PointOfView">

<element name="Info" type="mds:StructuredAnnotation"

minOccurs="0" maxOccurs="1"/><element name="Value" type="mds:PrimitiveImportanceType"

minOccurs="1" maxOccurs="unbounded"/><attribute name="id" type="ID" use="optional"/><attribute name="ViewPoint" type="string" use="required"/>

</complexType>

Editor’s Note: Why is "idref" needed? Does PointOfView not specify the importance of the segment including it?Editor’s Note: What is the reason for having more than one value for one segment and one view point?


Semantics of PrimitiveImportanceType

Name DefinitionPrimitiveImportanceType

Type of the primitive importance element with a decimal value within 0.0 and 1.0.

idref Identifier of the segment being weighted.

Semantics of PointOfView DS

Name DefinitionPointOfView Functionality specific segment filtering to describe the relative importance of the

segment from a given specific semantic viewpoint as importance values.id Identifier of point of view description.ViewPoint String that specifies the semantic viewpoint for weighting the segment. For

example, if the importance value is assigned to the video segment in a football game based on how important a segment is for the Team A in the game, the ViewPoint can be "TeamA".

Info Structured annotation that specifies supplementary information regarding the importance.

Value Primitive importance element.

1.1.1.3 Description extractionThe importance value is provided by using authoring tools, For example, a content creator provides importance values manually using authoring tools.

1.1.1.4 Description examplesThe following example describes the video segments of a football game that is attached the relative importance from the two viewpoints, the "Team A" and the "Team B".

<VideoSegment id="FootBallGame"> <MediaTime>...</MediaTime> <SegmentDecomposition DecompositionType="Temporal"> <VideoSegment id="seg1"> <MediaTime>...</MediaTime> <PointOfView ViewPoint="Team B"> <Value>0.7</Value> </PointOfView> </VideoSegment> <VideoSegment id="seg2"> <MediaTime>...</MediaTime> <PointOfView ViewPoint="Team A"> <Value>0.5</Value> </PointOfView> </VideoSegment> : <VideoSegment id="seg20"> <MediaTime>...</MediaTime> <PointOfView ViewPoint="Team A"> <Value>0.8</Value> </PointOfView> <PointOfView ViewPoint="Team B"> <Value>0.2</Value> </PointOfView> </VideoSegment> : </SegmentDecomposition></VideoSegment>

1.1.1.5 Description UseThe PointOfView DS can be used for summarizing multimedia contents dynamically from the combination of the viewpoint(s) and the duration which user or agent or system required. The functionality of segment filtering, which can

be enabled by PointOfView DS, can be used as a preprocessor for the Summary DS to make summary descriptions The PointOfView DS can also be used for scalable multimedia distribution by generating different multimedia presentations that consist of the filtered segments, which can be adapted to the different capabilities of the client terminals, network conditions or user preferences, and so on.

10.3 SegmentRelationship graph

10.3.1 SegmentRelationshipGraph DS The SegmentRelationshipGraph DS can represent a graph of segments and relationships among the segments. Although hierarchical structures such as trees provided by the segment DSs are adequate for efficient access and retrieval, some relationships can not be expressed using such structures. The SegmentRelationGraph DS adds flexibility in describing more general relationships among segments.

10.3.1.1 1.1.1.1 Description Scheme SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).


10.3.1.3 Description ExtractionThe extraction of relationships among segment can be done automatically or by hand. based on different criteria (e.g.

semantic and temporal information). This section describes the spatio-temporal structure (relationship) features extracted for segmented video objects’ regions in the AMOS system and latter used for retrieval (see sections 10.1.4.3 and 10.1.4.5) [25]. In this section, spatio-temporal structure of an object should be understood as the spatio-temporal relationships among its composing regions.

In AMOS, given a set of regions of a semantic video object, the structure features are derived from their spatial and temporal positions and boundaries. These features are pre-computed and properly stored to allow quick access and examination in the similarity matching process. AMOS uses two spatial and one temporal structure features (Figure 20).

Figure 20: Examples of spatio-temporal relationship graphs

There are several different ways to compare spatial structure, such as 2D-Strings [7] and spatio-temporal logics [5]. AMOS uses a relatively fast and simple method, the spatial-orientation graph, to represent spatial relationships as edges in a weighted graph [8]. This graph can be easily constructed by computing the orientation of each edge between the centroids of a pair of query regions, and stored as a n*(n-1)/2-dimension feature vector, where

n is the number of query regions (Figure 20a). The spatial-orientation graph cannot handle the "contain" relationship. AMOS complements it with a topological graph (Figure 20b), which defines the contain relationship between each pair of regions with three possible values: contains (1), is-contained (-1) or not containing each other (0).

In a similarly way, the temporal graph (Figure 20c) defines whether one region starts before (1), after (-1) or at the same time (0) with another region. Note here for simplicity we only define the temporal order according to the first appearing time (or starting time) of each region. By taking the ending time, a more complicated temporal relationship graph can also be generated.

10.3.1.4 Description examples

10.3.1.4.1 Documentary Video

Consider the nature documentary in Figure 21. Some temporal relationships among the narrator moving region (id="narrator-movreg"), the rabbit moving region (id="rabbit-movreg"), the sun moving region (id="sun-movreg") are described using the SegmentRelationshipGraph DS below. These segments are assumed to have been described using the Segment DSs with the specified id attributes. The example below shows how to list relationships among segments by SegmentRelationshipNode DS ("Start" relationship) and by SegmentNode DS (‘Before" relationship).

Figure 21: Nature documentary.

<SegmentRelationshipGraph type="temporal"> <SegmentRelationshipNode> <SegmentRelationship type="temporal" degree="2" name="Starts"/> <SegmentNode> <ReferenceToSegment idref="sun-movreg"/> </SegmentNode> <SegmentNode> <ReferenceToSegment idref="rabit-movreg"/> </SegmentNode> </SegmentRelationshipNode>

<SegmentNode> <ReferenceToSegment idref="presentator-movreg"/> <SegmentRelationshipNode> <SegmentRelationship type="temporal" degree="2" name="Before"/> <SegmentNode> <ReferenceToSegment idref="sun-movreg"/> </SegmentNode> </SegmentRelationshipNode> </SegmentNode></SegmentRelationshipGraph>

10.3.1.4.2 Video shot in a soccer gameFigure 22 shows the key frame of a video shot capturing a goal in a soccer game. The description of this video shot at the structural level with the current MDS XM/WD could be as follows. The entire video shot could be described by a VideoSegment DS with id "Goal-sg". The kick and not-catch events in the video shot could be represented by two video segments with ids "Kick-sg" and "Not-Catch-sg", respectively. The spatio-temporal regions corresponding to the goal, the ball, the forward, and the goalkeeper could be described by MovingRegion DSs with ids "Goal-rg", "Ball-rg", "Forward-rg", and "Goalkeeper-rg", respectively.

Figure 22: Key frame of a video shot capturing a goal in a soccer game.

Figure 23: Syntactic relationships among "Goal-sg" segment, "Kick-sg" segment, "Not-Catch-sg" segment, and "Goal-rg" region for Figure 22.

RegionId = Forward-rg

RegionId = Ball-rg

RegionId = Goalkeeper-rg

SegmentId = Kick-sg

SegmentId = Not-Catch-sg

Composed ofType = Temporal

Topological

Composed ofType = Temporal

Topological

Upper right ofType = Spatial

Directional

Moves towardsType = Visual

Global

RegionId = Goal-rg

Figure 24: Syntactic relationships among "Kick-sg" segment, "Not-Catch-sg" segment, "Goal-rg" region, "Forward-rg" region, "Ball-rg" region, and "Goalkeeper-rg" region for Figure 22.

Once these video segments and moving regions are defined by the VideoSegment and MovingRegion DSs, the SegmentRelationshipGraph DS could be used to specify relationships among them. Some examples of structural relationships among these elements are shown in Figure 23 and Figure 24 following the most common diagrammatic notations for entity-relationship models.

In Figure 23, the "Goal-rg" region, the "Kick-sg" video segment, and the "Not-Catch-sg" video segment are described as being temporally included in the "Goal-sg" segment. At the moment, the relationships among moving regions and video segments can only be specified using the SegmentRelationshipGraph DS . The "Kick-sg" video segment is also described as happening before the "Not-Catch-sg" video segment with a temporal, metric relationship, "Before".

In Figure 24, the "Kick-sg" video segment is described as being temporally composed of the "Forward-rg" and the "Ball-rg" moving regions. Similarly, the "Not-Catch-sg" video segment is described as being temporally composed of the "Ball-rg" and the "Goalkeeper-rg" moving regions. The "Goalkeeper-rg" region is related to the "Forward-rg" moving region by a spatial, directional relationship, "Upper right of". The "Ball-rg" moving region is related to the "Goal-rg" moving region by a visual, global relationship, "Moves towards".

The examples of structural relationships shown in Figure 23 and Figure 24 are written in XML below.

<SegmentRelationshipGraph><SegmentNode>

<ReferenceToSegment idref="Goal-sg"/><SegmentRelationshipNode>

<SegmentRelationship type="Temporal Topological" name="Composed of"/>

<SegmentNode><ReferenceToSegment idref="Kick-sg"/>

<ReferenceToSegment idref="Not-Catch-sg"/><ReferenceToSegment idref="Goal-rg"/>

</SegmentNode></SegmentRelationshipNode>

</SegmentNode><SegmentNode>

<ReferenceToSegment idref="Kick-sg"/><SegmentRelationshipNode type="Temporal Metric" name="Before">

<SegmentNode><ReferenceToSegment idref="Not-Catch-sg"/>


</SegmentNode>

<SegmentNode>

<ReferenceToSegment idref="Kick-sg"/><SegmentRelationshipNode>


<SegmentNode><ReferenceToSegment idref="Forward-rg"/><ReferenceToSegment idref="Ball-rg"/>



<ReferenceToSegment idref="Not-Catch-sg"/><SegmentRelationshipNode>


<SegmentNode><ReferenceToSegment idref="Goalkeeper-rg"/><ReferenceToSegment idref="Ball-rg"/>



<ReferenceToSegment idref="Goalkeeper-rg"/><SegmentRelationshipNode>

<SegmentRelationship type="Spatial Topological" name="Upper right of"/>

<SegmentNode><ReferenceToSegment idref="Forward-rg"/>



<ReferenceToSegment idref="Ball-rg"/><SegmentRelationshipNode>

<SegmentRelationship type="Visual Global" name="Moves towards">

<SegmentNode><ReferenceToSegment idref="Goal-rg"/>


</SegmentNode></SegmentRelationshipGraph>

10.3.1.4.3 SMILSynchronized Multimedia Integration Language (SMIL, pronounced "smile") [21] is a W3C standard that enables simple authoring of TV-like multimedia presentations that combine audio, video, images, and text. The SMIL standard

defines an XML-based language that allows control over the what, where, and when of media elements in a multimedia presentation with a markup language similar to HTML.The two SMIL basic elements for controlling the sequence and timing of the media element in the presentation are <par> and <seq>. Media elements within <par> tags will play in parallel. Media elements referenced within <seq> tags will play in sequence. It is possible to nest tags in any combination to create complicated effects. The attributes begin, end, and dur of the media elements set explicit timing properties to control the start and the MediaDuration of each media element. The presentation described below begins with a text during 20 seconds followed by a video with an audio narration and a slide show in parallel with a sequence of video and audio (which starts 2 seconds after the video) narration.

<seq><text src="text1" dur="20s"/><par>

<audio src="audio1"/><img src="image1"/>

</par><par>

<img src="image2"/><seq>

<audio src="audio2" begin="2s"/><video src="video1/>

</seq></par>

</seq>

These timing relationships are applicable to segments and temporal regions in the Generic DS. And candidates for SegmentRelationships SMIL’s parallel and sequential relationships can include attributes such as begin, end, and MediaDuration. We could include these in the SegmentRelationshipGraph DS refining its DSs through inheritance. See below for possible refinements of the SegmentRelationship and the SegmentEntityNode of the SegmentRelationshipGraph DS.

<complexType name="SMILRelationship" base="SegmentRelationship" derivedBy="extension"/>

<attribute name="name"><simpleType base="string">

<enumeration value="sequential"/><enumeration value="parallel"/>

</simpleType></attribute><attribute name="degree" use="fixed" value="2"/>

</complexType>

<complexType name="SMILSegmentNode" base="SegmentNode" derivedBy="extension"/>

<attribute name="begin_time" type="time"/><attribute name="end_time" type="time"/><attribute name="MediaDuration" type="time"/>

</complexType>

Consider the nature documentary in Figure 25. Some temporal relationships among the narrator segment (Narrator-sg), the rabbit still region (Rabbit-rg), the sun still region (Sun-rg), and the eagle still region (Eagle-rg) in this video are expressed in XML below.

<SegmentRelationshipGraph name="SMIL"><SegmentRelationshipshipNode>

<SegmentRelationship xsi:type="SMILRelationship" name="Sequential"/>

<SegmentNode xsi:type="SMILSegmentNode"><ReferenceToSegment idref="narrator-sg"/>

</SegmentNode><RelationshipshipNode>

<SegmentRelationship xsi:type="SMILRelationship" name="Parallel"/>

<SegmentNode xsi:type="SMILSegmentNode"><ReferenceToSegment idref="sun-rg"/>

</SegmentNode><SegmentNode xsi:type="SMILSegmentNode"

end_time="0:00:20"><ReferenceToSegment idref="rabbit-rg"/>

</SegmentNode><SegmentNode xsi:type="SMILSegmentNode"

begin_time="0:00:05" MediaDuration="0:00:15"><ReferenceToSegment idref="eagle-rg"/>


</SegmentRelationshipNode></SegmentRelationshipGraph>

0:00:00 0:00:05 0:00:10 0:00:15 0:00:20 0:00:25 0:00:30

Figure 25: Nature documentary.

10.3.1.4.4 2D-StringThe 2D-String scheme [7] is a data structure for spatial reasoning and spatial similarity computing. We use it to present the spatial relationships among the regions corresponding to objects depicted in Figure 26. In Figure 26, the following objects could be identified: rabbit, cloud, airplane, and car. The 2D-String representation of the image in Figure 26 is: (rabbit < cloud < car = airplane, rabbit = car < airplane < cloud), where the symbol "=" denotes the spatial relationships "at the same x as" and "at the same y as", the symbol "<" denotes the spatial relationships "left of/right of" and "below/above", and the symbol ":" denotes the spatial relationship "in the same set as".See below the XML representation of the associated 2D-String for the example in Figure 26. In this case, we nest relationships inside other relationships to represent the x and y dimensions of the 2D-String efficiently. We also list the relationships by the relationships and not the entity nodes for efficiency. When a relationship includes more than two entity nodes, the relationship is applied to all the pairs of entity nodes in order (e.g. the rabbit is to the right of cloud, which is to the right of car and airplane).

Figure 26: An image with associated 2D-String (rabbit < cloud < car = airplane, rabbit = car < airplane < cloud).

<SegmentRelationshipGraph name="2D-String"><SegmentRelationshipNode>

<SegmentRelationship type="Spatial Metric" name="Right of" degree="2"/>

<SegmentNode><ReferenceToSegment idref="rabbit-rg"/>


<ReferenceToSegment idref="cloud-rg"/></SegmentNode><SegmentRelationshipNode>

<SegmentRelationship type="Spatial Metric" name="At the same x as"/>

<SegmentNode><ReferenceToSegment idref="car-rg"/>


<ReferenceToSegment idref="airplane-rg"/></SegmentNode>

</SegmentRelationshipNode></SegmentRelationshipNode><SegmentRelationshipNode>

<SegmentRelationship type="Spatial Metric" name="Top of" degree="2"/>

<SegmentRelationshipNode><SegmentRelationship type="Spatial Metric"

name="At the same y as"/><SegmentNode>

<ReferenceToSegment idref="rabbit-rg"/></SegmentNode><SegmentNode>

<ReferenceToSegment idref="car-rg"/></SegmentNode>

</SegmentRelationshipNode><SegmentNode>

<ReferenceToSegment idref="airplane-rg"/></SegmentNode><SegmentNode>

<ReferenceToSegment idref="cloud-rg"/></SegmentNode>

</SegmentRelationshipNode></SegmentRelationshipGraph>

10.3.1.5 Description UseDescriptions of the SegmentRelationshipGraph DSs can be used in retrieval and visualization applications, among others. See section 10.1.2.3 for a query model for searching of video objects that includes spatio-temporal relationships among the objects’ regions as described in section 10.3.2.

10.3.2 Type of relationshipsSome segment relationships are normalized.

10.3.2.1 Normative segment relationshipsSee the corresponding section of the Working Draft.

10.3.2.2 Non-normative segment relationshipsIn this section, we describe a possible classification framework for segment relationships and provide more examples of segment relationships, which have not been normalized yet.

The segment relationships can be divided into three classes: spatial, temporal, and visual. One could argue that the spatial and the temporal relationships are just special cases of visual relationships. We define spatial and temporal relationships in a special way; we consider the elements as boundaries in space or time with no information about size or MediaDuration, respectively. See the table at the end of this section for a summary of the types of structural relationships and examples.

The spatial relationships can be divided into the following classes: (1) topological, i.e. how boundaries of elements relate; (2) metric, i.e. distance in spatial space; and (3) orientation or directional, i.e. where the elements are placed

relative to each other in 2D space. Examples of topological relationships are "To contain", "To overlap", and "To be adjacent to"; examples of metric relationships are "To be near" and "To be far"; examples of directional relationships are "To be in front of", "To be to the left of", and "To be on top of". Well-known spatial relationship graphs are 2D-String, Theta-R Graph, and Attributed-Relationship Graphs.

In a similar fashion, the temporal relationships can be classified into topological, metric, and directional classes. Examples of temporal topological relationships are "To happen in parallel", "To overlap", and "To happen within"; examples of metric temporal relationships are "To happen near to", and "To happen far from"; examples of directional temporal relationships are "To happen before", and "To happen after". SMIL ’s parallel and sequential relationships are examples of temporal topological relationships.

Visual relationships relate elements based on their visual attributes or features. These relationships can be indexed into global, local, and composition classes. For example, a visual global relationship could be "To be smother than" (based on a global texture feature), a visual local relationship could be "To accelerate faster" (based on a local motion feature), and a visual composition relationship could be "To be more symmetric than" (based on a 2D geometry feature). Visual relationships can be used to group structural elements based on any combination of visual features: color, texture, 2D geometry, motion, deformation, camera motion, etc.

Structural relationships can be defined at different levels (see section 4.5.1 on the WD document): at a generic level ("Near") or at a specific level ("0.5 feet from"). For example, operational relationships such as "To be the union of", "To be the intersection of", and "To be the negation of" are topological, specific relationships either spatial or temporal.

Types of relationships ExamplesStructural Spatial Topological Adjacent to, Overlap, Contained in, Composed of,

Consist of The union, The intersection, The negation

Metric Near from, Far from R in Theta-R Graph, 0.5 inches from

Directional Left of, Top of, Upper left of, Lower right of. Behind

2D-String’s spatial relationships 20 degrees north from, 40 degrees east from, the

union of two segments Theta and R in Theta-R Graph

Temporal Topological Co-begin, Co-End, Parallel, Sequential, Overlap, Adjacent, Within, Composed, Consist of

SMIL’s <seq> and <par> The union, The intersection, The negation SMIL’s <seq> and <par> with attributes (start

time, end time, MediaDuration)Metric Near from, Far from

20 min. apart from, 20 sec. overlapDirectional Before, After

20 min. afterVisual Global Smother than, Darker than, More yellow than,

Similar texture, Similar Color, Similar speed Distance in texture feature, Distance in color

histogram Indexing hierarchy based on color histogram

Local Faster than, To grow slower than, Similar speed, Similar shape

20 miles/hour faster than, Grow 4 inches/sec. faster than

Indexing hierarchy based on local motion, deformation features

Composition More symmetric than, Distance in symmetry feature Indexing hierarchy based on symmetry feature

Classification framework for segment relationships

11 Description of the conceptual aspects of the content

Most elements describing the conceptual aspects of content are still under validation and core experiments. The Affective DS describes audiences’ affective information of segments.

11.1.1 Affective DS The Affective DS describes the audiences’ affective information of segments by assigning a score to each segment. The resulting representation of score along the story timeline includes the audiences’ mood/emotion change during their content viewing as well as a story shape, i.e. how a story is developing along the story timeline.




<simpleType name="NormalizedType" base="string">

<enumeration value="None"/><enumeration value="ByPeak"/><enumeration value="ByTotal"/>

</simpleType>

<simpleType name="AffectType" base="string">

<enumeration value="Interested"/><enumeration value="Excited"/><enumeration value="Bored"/><enumeration value="Surprised"/><enumeration value="Sad"/><enumeration value="Hateful"/><enumeration value="Angry"/><enumeration value="Expectant"/><enumeration value="Happy"/><enumeration value="Scared"/><enumeration value="StoryComplication"/><enumeration value="StoryShape"/>

</simpleType>

Editor’s Note: AffectType could be a ControlledTerm.

<element name="Affective">

<complexType><element name="Score" minOccurs="1" maxOccurs="unbounded">

<complexType base="mds:ZeroToOneDecimalDataType"derivedBy="extension"><attribute name="idref" type="IDREF"

use="optional"/></complexType>

</element><element name="Info" type="mds:StructuredAnnotation"

minOccurs="0" maxOccurs="1"/><attribute name="id" type="ID" use="optional"/><attribute name="Confidence"

type="mds:ZeroToOneDecimalDataType"use="default" value="1.0"/>

<attribute name="AffectValue" type="mds:AffectType"

use="required"/><attribute name="Normalized" type="mds:NormalizedType"

use="default" value="None"/></complexType>

</element>


Semantics of the datatypes

Name DefinitionScoreValueType Datatype of the score value defined within 0.0 and 1.0 where 1.0 for the highest

score while 0.0 for the lowest.NormalizedType Type of weight value normalization supported by this scheme.None The weight values are not normalized.ByPeak The weight values are normalized by their peak (maximum) value.ByTotal The weight values are normalized by sum of all the weight values, i.e., total of the

weight value is equal to 1.AffectType Type of affect supported by this scheme.Interested The weight value indicates how interested the audiences are.Excited The weight value indicates how excited the audiences are.Bored The weight value indicates how bored the audiences are.Surprised The weight value indicates how surprised the audiences are.Sad The weight value indicates how sad the audiences feel.Hateful The weight value indicates how hateful the audiences feel.Angry The weight value indicates how angry the audiences feel.Expectant The weight value indicates how expectant the audiences feel.Happy The weight value indicates how happy the audiences feel.Scared The weight value indicates how scared the audiences feel.StoryComplication The weight value indicates the intensity of the story complication.StoryShape The weight value indicates how a story is developing along the story timeline. Its

peak depicts the climax of the story

Semantics of Affective DS

Name DefinitionAffective Functionality specific segment weighting to describe the relative intensity of the

audiences’ affective information at each segment.id Identifier of an affective description.Confidence Decimal within 0.0 and 1.0 that specifies the confidence of the score values. AffectValue To specify the kind of affect for this weighting.Normalized To specify the type of weighting value normalization. Non- normalized is specified

as a default value.NormalizedType Type of weight value normalization supported by this scheme.Score Decimal within 0.0 and 1.0 that represents the intensity of specified affect. 1.0

indicates the highest score while 0.0 indicates de lowest score. It also includes an attribute with the identifier of the segment being weighted.

Info Structured annotation that specifies supplementary information regarding the weighting.

11.1.1.3 Description ExtractionWhile the weighting information can be obtained through subjective analysis based content evaluation, it also can be extracted based on physiological measurement and analysis of the audiences..

11.1.1.4 Description ExamplesThe following example describes the "Story Shape" of the video segment. Note that the reliability value specified as 0.85 indicates that the weight value in the example is obtained based on a certain statistical analysis that provides the reliability of the weight value.

<VideoSegment id="program"> <MediaTime>...</MediaTime> <SegmentDecomposition DecompositionType="Temporal"> <VideoSegment id="scene1"> <MediaTime>...</MediaTime> </VideoSegment> <VideoSegment id="scene2"> <MediaTime>...</MediaTime> </VideoSegment> : </SegmentDecomposition></VideoSegment>

<Affective id="affect1" Reliability="0.85" AffectValue="StoryShape" Normalized="ByPeak"> <Info> <TextAnnotation> by N Univ. students based on Semantic Score Method </TextAnnotation> </Info> <Value idref="scene1">0.00360117</Value> <Value idref="scene2">0.00720234</Value> :</Affective>

11.1.1.5 Description UseThe description of Affective DS provides high-level information on audiences' interpretation as well as perception of target content and can be used in various ways. One of the examples is given as a preprocessing for video summary: In the case of story shape for example, you can obtain a video summary that reflects the story development. Furthermore, highlight scene video summary can be obtained by selectively concatenating high score video segment when "AffectValue" takes a value of, e.g., "excited". The description also can be used as fundamental data in high-level content analysis. For example, since the time dependent score pattern is strongly depends on the genre of content, it may be used to classify content into its genre.

Editor’s Note: Explain the use of the attribute "Normalized".

12 Content navigation and access

12.1 Summarization

12.1.1 Summarization DS The Summarization DS is used to specify a set of summaries to enable rapid browsing, navigation, visualization and sonification of AV content. Each summary is an audio-visual abstract of the content.



12.1.1.3 Description ExtractionN.A.

12.1.1.4 Description Example<Summarization> <HierarchicalSummary>...</HierarchicalSummary> <SequentialSummary>...</SequentialSummary></Summarization>

12.1.1.5 Description UseThe Summarization DS enables fast and effective browsing and navigation of AV content by providing immediate access to audio-visual abstracts. The Summarization DS does not prescribe specific ways an application uses or presents these abstracts.

12.1.2 Summary DS The Summary DS is used to specify an audio-visual abstract of AV content. The role of the Summary DS is to convey information about the AV content that is essential for rapid browsing and navigation.




12.1.2.4 Description ExampleN.A.

12.1.2.5 Description UseN.A.

12.1.3 HierarchicalSummary DS The HierarchicalSummary DS is used to specify a group of audio-visual summaries, possibly ordered hierarchically. All audio-visual summaries in this group have the same type, but each represents an alternative view of the AV content.

The HierarchicalSummary DS may organize summaries into a succession of levels, each describing the audio-visual content at a specific level of detail. In general, levels closer to the root of the hierarchy provide coarse summaries and levels further away from the root provide more detailed summaries. Elements in the hierarchy are specified by the HighlightLevel DS. Each element in a hierarchy has one parent. Each element in a hierarchy may have one or more children.



12.1.3.3 Description ExtractionSee section 12.1.5.

12.1.3.4 Description ExampleThe following is an example of an HierarchicalSummary, which contains a summary with key video clips. This summary may, for example, contain interesting video clips, ordered in multi-level fashion. Since the hierarchyType of this HierarchicalSummary is dependent, a highlight summary at level n+1 adds more video clips to the highlight summary at level n. Thus, each level accumulates more information to provide a longer and more extensive video summary. The RefLocator element refers to the original (source) video that is being summarized.

<Summarization> <HierarchicalSummary name="keyVideoSummary001" summaryTypeList="keyVideoClips" hierarchyType="dependent"> <RefLocator> <MediaURL>file://disk/video001.mpg</MediaURL> </RefLocator> <ReferenceToSegment idref="segment001"/> <HighlightLevel>...</HighlightLevel> </HierarchicalSummary></Summarization>

12.1.3.5 Description UseSee section 12.1.5.

12.1.4 HighlightLevel DS The HighlightLevel DS is used to specify a summary at a particular level of detail by referring to a sequence of audio-visual segments (such as video clips or audio clips) or images (such as key-frames). A HighlightLevel at a particular level in the hierarchy may correspond to, for example, a highlight with particular time duration or a particular set of events.




12.1.4.4 Description ExampleThe following is an example of a set of two highlights referring to particular events in a program, in particular "slam dunks" and "three-point shots" in a basketball game video. The first summary contains two video clips, each showing a slam-dunk; the second summary contains three video clips, each showing a three-point shot. By grouping the clips into

summaries of events, a user may choose to view only the clips of slam-dunks; alternatively, the user may view all three-point shots. Note that in this case, there is no notion of hierarchy in the underlying real-world events.

<Summarization> <HierarchicalSummary name="keyEventsSummary001" summaryTypeList="keyVideoClips keyEvents"> <ReferenceToProgram idref="mediainstance1"/> <SummaryThemeList> <SummaryTheme xml:lang="en" id="E0"> slam dunk </SummaryTheme> <SummaryTheme xml:lang="en" id="E1"> 3-point shots </SummaryTheme> </SummaryThemeList> <HighlightLevel name="summary001" themeIds="E0"> <HighlightSegment>...</HighlightSegment> <HighlightSegment>...</HighlightSegment> </HighlightLevel> <HighlightLevel name="summary002" themeIds="E1"> <HighlightSegment>...</HighlightSegment> <HighlightSegment>...</HighlightSegment> <HighlightSegment>...</HighlightSegment> </HighlightLevel> </HierarchicalSummary></Summarization>

The following is an example of a hierarchical summary consisting of a set of video clips organized into two levels. At the highest level, the summary has a duration of 40 seconds and consists of only two video clips. At the second level, the summary has a duration of 100 seconds and consists of five video clips. Note that the coarse and the fine summary share two video clips, but these common clips are specified only once, by utilizing a dependent hierarchical structure (hierarchyType is dependent).

<Summarization> <HierarchicalSummary name="keyVideoSummary002" summaryTypeList="keyVideoClips" hierarchyType="dependent"> <ReferenceToProgram idref="program001"/> <HighlightLevel name="highlight001" level="0" duration="40"> <HighlightSegment> ... </HighlightSegment> <HighlightSegment> ... </HighlightSegment> <HighlightLevel name="highlight002" level="1" duration="100"> <HighlightSegment> ... </HighlightSegment> <HighlightSegment> ... </HighlightSegment> <HighlightSegment> ... </HighlightSegment> </HighLightLevel> </HighlightLevel> </HierarchicalSummary></Summarization>

12.1.4.5 Description UseThe HighlightLevel DS can be used to construct video highlights, audio highlights, or key-frame summaries. The fidelity values of HighlightLevel elements in a HierarchicalSummary may be used by an application to adaptively select HighlightLevel elements for customization purposes. For instance, a variable number of key-frames can be extracted from the HierarchicalSummary in a scalable manner, given fidelity values for all elements with key-frames in the hierarchy. See section 12.1.5.

12.1.5 HighlightSegment DS The HighlightSegment DS is used to specify an audio-visual segment of AV content. A HighlightSegment DS may be used to refer to a videoclip, an audioclip, key-frames and key-sounds.



12.1.5.3 Description ExtractionKeyVideoClips summaryVideo clips constituting a particular video highlight can be selected manually according to the relative importance of their contents given the time duration constraint.

KeyAudioClips summaryAudio clips constituting a particular audio highlight can be selected manually according to the relative importance of their contents given the time duration constraint.

KeyEvents summaryVideoclips associated with a particular event can be selected manually.

KeyFrames summaryIn the following, three different algorithms are described for automatic hierarchical key-frame extraction.

Algorithm AA simple approach to automatic generation of a hierarchical key-frame summary of a video program is as follows.1. Uniformly subsample the video frames temporally by a factor of N to determine the finest level key-frame

summary; 2. Subsample the result of (1) temporally by a factor of M to determine the next coarser level key-frame summary; 3. Repeat step (2) recursively. This process will generate a hierarchical key-frame summary in fine-to-coarse manner.

Algorithm BAlgorithm B uses measures of scene action to detect shot boundaries and to determine the number and position of key-frames allocated to each shot, as follows. Each video frame si within shot s is represented by its color histogram vector

, where q is the color index. Define a measure of action between two frames by the difference of their

histograms and , i.e., .

1. Shot boundaries are determined by thresholding the action measure as follows. Assume that the first frames (e.g., ) of the sequence do not correspond to shot boundaries. Compute the mean action measure across the first frames. Set the threshold for shot boundary detection to , where is the standard variation of the action measure and is a predetermined parameter (e.g., ). Once a boundary is detected, a new threshold is determined in the same manner using the first frames of the next shot.

2. Define a cumulative action measure for the first frames in a shot as . Given a total

number of key-frames to be allocated to the entire video, allocate each shot a number of key-frames, , proportional to the total cumulative action in that shot.

3. For each shot, approximate the area under the cumulative action curve with rectangles with variable width, where the density of the rectangles increases with the slope of this curve. Along the temporal axis, the rectangles partition a shot into contiguous shot segments separated by breakpoints. The frame that is situated at the midpoint of a shot segment is defined as the key-frame representing that shot. The following iterative algorithm determines the breakpoints and the time instances of associated key-frames .

Let the time instance of the first break-point be ; Let the location of the first key-frame be ;FOR through DO {

Compute the next break-point as ;

IF exceeds the shot boundary THEN STOP;

ELSE Let be the first frame past for which

;}

4. Optimize the time instances of key-frames starting from the second key-frame positioned at as follows.

Determine ( ) by choosing that frame between and that produces the largest value of the action

measure with respect to the previous key-frame . This criterion can be referred to as the "largest successive difference" criterion.

A hierarchical key-frame summary is generated by clustering the key-frames extracted by the algorithm described above. Clustering is performed using the histogram vectors of key-frames. The clustering algorithm considers pairs of clusters of key-frames, where key-frames belonging to the same cluster are consecutive in time. Let be the number of key-frames in the most detailed summary of a particular shot. The next coarser level has key-frames, where is an arbitrary but predetermined compaction ratio. Start with an equally spaced partitioning of the key-frame histograms where each of the resulting partitions contains histogram vectors of consecutive key-frames (see Figure27). Then, starting with the first partition, adjust each partition boundary between two adjacent partitions so as to minimize the l2 norm for the two adjacent partitions on either side of the partition boundary, as follows.

1. Assign the centroid histogram vector as the representative vector for each partition;2. If is the representative vector for the vectors in the partition and is the representative

vector of the adjacent partition , adjust such that the total sum of squared distances of vectors in each cluster to the corresponding representative vector is minimized (see Figure 27).

3. If , then delete from the representative set of vectors. If , then delete from the set of representative vectors.

4. Continue with the next pair of partitions, until all partition boundaries are processed.

Apply steps 1-4 for 10 iterations (or until the decrease in total distortion is insignificant). After stopping, the frame in the first cluster whose histogram vector is closest to the representative vector is selected as the first key-frame. Key-frames for subsequent clusters are determined according to the "largest successive difference" criterion expressed in terms of the action measure. Coarser level summaries can be obtained by recursive application of the above procedure.

(a) (b) (c)

Figure 27: Pairwise clustering for hierarchical key-frames summarization in Algorithm B. In this example, the compaction ratio is 3. First T1 is adjusted in (a) considering only the two consecutive partitions at either side of T1. Then T2 and T3 are

adjusted as depicted in (b) and (c), respectively.

Algorithm CKey-frame extraction in Algorithm C consists of the following steps, illustrated in Figure 28 (see also [3] and [16]).

1. Feature points are extracted from each frame and then Kalman filtering is used to track the feature points in the next frame. Feature points correspond to points which contain a significant amount of texture, such as corner points. Such points are good candidates for tracking. The algorithm used for the feature points extraction step has been developed by Kanade et al. The KLT software library, which is publicly available at http://vision.stanford.edu/~birch/klt/index.html, can be used to implement the feature points extraction step. Other feature point extraction algorithms can be used also. Since many feature points have to be tracked, a data association filter is required. The nearest neighbor filter is used within this algorithm. In order to validate the association, a texture descriptor, characterizing the neighborhood of the feature point, is used. A track at time k is defined as a sequence of feature points up to time k which have been associated with the same target. See Figure for an illustration.

2. Shot boundaries are detected using an activity measure based on the rate of change in tracks. This activity measure depends on the number of terminated or initiated tracks, and is defined as the maximum between terminated and

http://vision.stanford.edu/~birch/klt/index.html

initiated tracks calculated as a percentage. The percentage of initiated tracks is the number of new tracks divided by the total number of tracks in the current frame, while the percentage of terminated tracks is the number of removed tracks divided by the total number of tracks in the previous frame. For shot boundary detection, a video sequence is modeled as a set of successive stationary and nonstationary states of the activity measure. Significant events correspond to the stationary states, which are characterized by a constant or slowly time-varying activity change. On the other hand, shot boundaries correspond to an abrupt change (cut) or fast change (dissolve). Accordingly, the temporal segmentation algorithm should fulfill the following requirements: a) detection of abrupt changes or fast changes; b) detection of stationary segments. For this purpose, a recursive temporal segmentation algorithm is used, which models the data as a succession of states represented as a Gaussian process. A change in the state corresponds to a change in the parameters of the process (mean and variance 2). The following equations are used in order to update the process parameters:

where is the current activity change and is a coefficient acting as an attenuation factor. If then a new Gaussian process is initialized, with mean equal to the current activity change and standard deviation set as a large value. Figure shows the activity change and its representation as a model of succession of Gaussian processes. Short impulses correspond to short processes with high activity change (shot boundaries), while longer segments correspond to Gaussian processes representing significant events.

3. A representative key-frame is extracted from the stationary or weakly non-stationary states (flat or oblique parts of the activity measure). The frame in the middle of the flat or oblique part is selected as a key-frame. Such a choice would allow representing the state in a compact way. For instance, in the case of a state that is part of a zooming camera operation, the frame in the middle of this state is a good compromise between the wide and focused view. A significant state corresponds to a state with a long duration or a high activity change value. The introduction of a significance value, computed as the product of the activity change value (activity change mean) and the duration of the corresponding state, allows us to have a scalable summary. The significance value, which is assigned to each key frame, could be interpreted as the surface of the rectangle built by the duration of the state and the activity change mean.Once the key frames have been ordered according to their significance value, we can adapt the number of key frames according to the user’s preferences or the client device capabilities.

A hierarchy of key-frames can be extracted as follows. 1. Let N0 be the number of key-frames extracted; this set of key-frames forms the first level in the hierarchy. The

above algorithm assigns a significance value to each key-frame extracted. 2. Let N1 be the desired number of key-frames (N1<N0) at the next level of the hierarchy. These N1 key-frames can

be determined out of the N0 frames via rank ordering of the significance values. 3. By repeating step 2 recursively, the algorithm facilitates a multi-level key-frame summary with decreasing number

of key-frames, without having to apply the key-frame extraction algorithm multiple times. For example, for a 3-level summary, one has N0>N1>N2, where N0, N1, and N2 denote the number of key-frames in the first, second and third levels. Note that these levels are not dependent, i.e., there is no parent-child relationship between the key-frames at two successive levels.

Figure 28: Shot boundary detection and key-frame selection in Algorithm C.

Figure 29: Example tracking result (frame numbers 620, 621, 625). Note that many feature points disappear during the dissolve, while new feature points appear.

Figure 30: Activity change (top). Segmented signal (bottom).

KeyFrames summary with fidelity values

The following algorithm describes computation of fidelity values associated with key-frames in a hierarchical key-frame summary. Assume that a feature descriptor such as a color histogram is available for each key-frame. Define a distance metric on the color histograms such as the l1 norm. Consider a hierarchical summary with key-frames as shown in Figure 31. The key-frame in each node is identified by A, B, etc., while fidelity values are denoted by e1, e2, etc. Consider, for example, the fidelity value e1, which indicates how well the key-frames in the subtree with root at node B is represented by the key-frame at node A. The fidelity value e1 can be obtained as follows. 1. Compute the maximum distance between the histogram of A and those of its children B, E, F and G; 2. Take the reciprocal of this maximum distance.After all fidelity values in the hierarchy are computed, normalize them so that the fidelity values lie between 0.0 and 1.0.

Figure 31: An example of a key-frame hierarchy.

12.1.5.4 Description ExampleThe following is a description of the key-frame hierarchy in Figure 31. Note that children elements are specified before specifying other information about an element in the hierarchy, i.e. the description is in post-order. The fidelity values refer to Figure 31 and should be understood as numerical values.

<Summarization> <HierarchicalSummary name="mySummary003" summaryTypeList="keyFrames" hierarchyType="independent"> <ReferenceToSegment idref="segment001"/> <HighlightLevel name="coarse keyframe summary" level=’0’> <HighlightLevel name="medium keyframe summary" level=’1’ fidelity=’e1’>

<HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e4’> <HighlightSegment name="key-frame E"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e5’> <HighlightSegment name="key-frame F"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e6’> <HighlightSegment name="key-frame G"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightSegment name="key-frame B"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="medium keyframe summary" level=’1’ fidelity=’e2’> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e7’> <HighlightSegment name="key-frame H"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e8’> <HighlightSegment name="key-frame I"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e9’> <HighlightSegment name="key-frame J"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightSegment name="key-frame C"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="medium keyframe summary" level=’1’ fidelity=’e3’> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e10’> <HighlightSegment name="key-frame K"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e11’> <HighlightSegment name="key-frame L"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightLevel name="fine keyframe summary" level=’2’ fidelity=’e12’> <HighlightSegment name="key-frame M">

<ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightSegment name="key-frame D"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> <HighlightSegment name="key-frame A"> <ImageLocator><MediaTime>...</MediaTime></ImageLocator> </HighlightSegment> </HighLightLevel> </HierarchicalSummary></Summarization>

12.1.5.5 Description UseThe HierarchicalSummary DS structures information in AV content into groups of video-clips, possibly using a hierarchical set of levels. The resulting description allows navigating content by "zooming in" and "zooming out" to various levels of detail.

KeyFrames summaryThe notion of fidelity allows a variety of practically useful functionalities such as fidelity-based summary, scalable hierarchical summary and quick search. For example, if a user specifies a preferred number of key-frames to preview the whole video, then the system utilizes the fidelity attributes in the summary to select a given number of key-frames which best represent the original video in a sense defined by the encoder used to automatically generate the key-frame hierarchy. A set of key-frames are then sent to the terminal device for display.

Consider a key-frame hierarchy in Figure 32. In this example, key-frame B has three fidelity values e4, e5 and e6 associated with each of its three subtrees. The value e4 indicates the degree of representativeness of key-frame B over its subtree rooted at E. Thus, key-frame B represents its children contained in the subtree rooted at E with fidelity e4. As we go down to finer levels, the fidelity values generally become larger, indicating that a key-frame at a finer level represents its subtree better than those at coarse levels. The following algorithm automatically selects a given number of key-frames from a key-frame hierarchy, based on the fidelity values. Denote the desired number of key-frames to be selected by N. Denote the set of key-frames with maximum fidelity by K.

initialize K to the empty set;add the root node to K;WHILE ( card(K) < N ) { consider nodes with a parent node such that K and K; let be the node among these nodes with lowest fidelity value; add to K;}

The algorithm above yields the set K that contains the N key frames selected. For example, in Figure 31, assume that N=2. If e3 is the minimum among e1, e2 and e3, the key frames A and D are selected. This decomposes the original tree in Figure 31 into the two subtrees shown in Figure 32.

Figure 32: An example of the key-frame selection algorithm based on fidelity values.

12.1.6 SequentialSummary DS The SequentialSummary DS is used to specify a single audio-visual summary, which contains a sequence of images or video frames, possibly synchronized with audio, composing a slide-show or audio-visual skim. The images or video frames and the audio clips that are part of the summary may be stored separately from the original AV content, to allow fast playback. Video frames may be stored individually or as part of a composite. Alternatively, the SequentialSummary may refer to the original frames in the AV program.




12.1.6.4 Description ExampleSee section 12.1.7.

12.1.6.5 Description UseThe sequence of images, or frames in a video program in a SequentialSummary can be shown sequentially in time – for example, as an animated slide show. It also supports fast playback of parts of a video program, by referring to a separately stored composite of video frames. See section 12.1.7.

12.1.7 FrameProperty DS The FrameProperty DS is used to specify certain properties associated with an image or video frame in a slide show or audio-visual skim. These properties may specify: the location of the image or video frame, possibly by referring to the original AV content; the relative amount of scene-activity associated with the video frame; a region-of-interest in the image or video frame.



12.1.7.3 Description ExtractionFrame-change-values are easily calculated from MPEG video streams because P-frames include information of motion vectors in macroblocks, from which frame-change-values can be computed, as follows.1. Extract motion vectors from the inter coded macroblocks in a P-frame and sum up their absolute values.2. Normalize the values with respect to the image size, by dividing the sum by the number of macroblocks in the frame.3. Normalize the values based with respect to the time interval, by dividing the result of Step 2 by the duration between

the P-frame and its reference picture.The following are two special cases: the first is the case of intra coded macroblocks, and the second is the case of field predicted macroblocks. If a macroblock is intra coded, it is ignored. For the field-predicted macroblocks, multiply the absolute values of the motion vectors by 1/2 (because there are two motion vectors in the macroblock).Note that a frame-change-value cannot be calculated from I- and B-frames. In order to calculate a frame-change-value for each frame, a linear approximation is used to interpolate missing values temporally. This procedure can be performed in a viewer application to decrease the size of the description. On the other hand, it is possible to calculate frame-change-values for all frames and include them in the description. In this case, viewer applications need no interpolation operation.

If the source video is not coded in MPEG-format, e.g. block-based template matching can be used to extract motion vectors. Another method for calculating frame-change-values is by applying pixel-wise subtraction of color values between one frame and the next frame. Using this method, almost the same effect can be obtained to implement smart quick view.

Note that the video skim or smart quick view may consist of thumbnail-sizes images, to save data size. Thumbnails can be extracted from I-frames in an MPEG bitstream, by decoding only the DC DCT-coefficients of macroblocks. The size of the extracted images is 1/16 of the original. Each frame can be coded individually (such as a sequence of bitmaps, JPEG-images, etc.) or packed into one composite file (such as a Motion-JPEG file, QuickTime movie-file, etc.). When the thumbnails are saved in separate files, each ImageLocator element will contain a URL. On the other hand, when the thumbnails are packed into a single file, the location of the file is described by a URL in the VideoSegmentLocator element and each ImageLocator element contains only frame-numbers or time-stamps.Thumbnail-sized images can also be obtained by selecting and decoding (if necessary) frames at some regular interval in a video stream, and down-sampling them into small size images. Note that smart quick view may require high-speed playback (without audio), in which case ready-made thumbnail-sized images are useful. However, in the case that frames from the original video stream are used, ready-made thumbnails are not required.

In some applications, it is useful to indicate regions-of-interest in video images with varying sizes and from varying locations. For instance, the video content may contain frames with text or faces, which should be visualized at a higher resolution to highlight such areas or to improve readability of the video text, while other video data may be visualized at a lower resolution. In this case, the Region element may be used to specify such regions-of-interest in the original video frames. Also, the images in the video skim can be clipped from the corresponding original frames and stored separately as still images, which can be referred to using the ImageLocator element in the FrameProperty DS. The selection of such regions can be done manually, and should generally be adapted to the content. For instance, region-of-interests can be image regions with text, such as the score in a sports game, or regions with faces of persons that the user may be interested in. Note that there may be multiple regions-of-interest corresponding to the same video frame; this can be specified using multiple FrameProperty elements referring to the same frame (using the MediaTime element).

12.1.7.4 Description ExampleThe following is an example of a SequentialSummary, enabling a smart quick view.

<SequentialSummary name="SoccerSummary">  <VideoSegmentLocator> <MediaURL>soccer.mpg</MediaURL> </VideoSegmentLocator>

 <FrameProperty>  <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>2 </RelIncrTime></RefTime> <ImageLocator>  <MediaTime><RelIncrTime nnFraction=’30’ nn=’1’>0 </RelIncrTime></MediaTime> </ImageLocator> </FrameProperty>

 <FrameProperty>  <refTime><RelIncrTime nnFraction=’30’ nn=’1’>5 </RelIncrTime></RefTime>  <FrameActivity>21</FrameActivity> </FrameProperty>

 <FrameProperty> <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>8 </RelIncrTime></RefTime> <FrameActivity>20</FrameActivity>

</FrameProperty>

 <FrameProperty> <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>11 </RelIncrTime></RefTime> <ImageLocator>  <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>1 </RelIncrTime></RefTime> </ImageLocator> </FrameProperty>

 <FrameProperty> <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>14 </RelIncrTime></RefTime> <FrameActivity>17</FrameActivity> </FrameProperty>

 <FrameProperty> <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>17 </RelIncrTime></RefTime> <FrameActivity>11</FrameActivity> </FrameProperty>

 <FrameProperty> <RefTime><RelIncrTime nnFraction=’30’ nn=’1’>20 </RelIncrTime></RefTime> <ImageLocator>  <MediaTime><RelIncrTime nnFraction=’30’ nn=’1’>2 </RelIncrTime></MediaTime> </ImageLocator> </FrameProperty></SequentialSummary>

Editor's Note: This example should be updated with respect to Time DS elements.

12.1.7.5 Description UseThe SequentialSummary DS specifies a video skim, similar to a fast forward video playback. The SequentialSummary DS also allows video playback with variable speed, also called "smart quick view". In conventional fast forward mode, video is played back at constant speed, independent of the amount of activity in the scene or scene motion, which can make it difficult to understand the video content. In smart quick view mode, video playback speed is adjusted so as to stabilize the amount of scene change. This requires computation of the amount of scene change for frames of the video. In this case, one of the frame-properties consists of the "frame-change-value", which is a measure of change from one frame to the next. Consequently, a viewer can dynamically adjust the playback frame rate. That is, playback speed is decreased if the frame-change-value is high and increased if it is low.

The following notation is used to describe normal quick view and smart quick view. Frame rate of the original video: R frames/secFrame number of original video: i (0, 1, 2, …, K)Frame-change-value for frame i: f(i)Playback speed factor with respect to the original video frame rate: mDisplay frame rate (playback frame rate): r frames/secDisplay cycle number: j (0, 1, 2, …, N)

In the case of normal quick view, for each display cycle j (j=0,1,2,…,N), the frame number in the original video i is given by: (m*R/r)*j. The total number of displayed frames N is calculated by K/(m*R/r). Frame i can be extracted from

the original video or a thumbnail-sized image may be used. In the case where a viewer application can only use I-frames of the original video, the I-frame nearest to the computed frame i can be used for display.

In the case of smart quick view, the frame-change-values are used to adjust playback speed automatically. Assume that frame-change-values f(i) are available for each frame i of the original video. If f(i) is not available for each frame, linear interpolation should be applied to compute the missing values. The frame-change-values f(i) are normalized such that their summation equals 1. Denote the normalized values by w(i), ( i=0,…, K-1). In order to stabilize playback in smart quick view mode and achieve the display frame rate r on average at the same time, the frame i to be displayed is controlled by the viewer based on w(i). Figure 33 shows a plot of the normalized frame-change-values w(i). The temporal axis can be partitioned into N segments, such that the sum of the frame-change-values w(i) inside each segment (approximately the area under the curve) equals 1/N. The boundaries between these segments are used as decision points to calculate the frame number i of the original video to be displayed, as follows. For display cycle j (the

display frame rate is r), find the first i for which . That is, accumulate the values of w(i) until their sum

exceeds j/N. This determines the frame number i of the original video to be displayed at time j. Again, the nearest I-frame can be used if other frames are not available.

Figure 33: Illustration of smart quick view.

Another functionality of the SequentialSummary DS is to visualize a video containing various regions-of-interest. Such regions-of-interest may be zoomed into or highlighted by the viewer application. Or, such regions-of-interest may be used to synthesize images composed of thumbnails with different resolutions. In the latter case, the thumbnail images are given by a sequence of still images with arbitrary size. The size of the thumbnail image and its location in the original frame is chosen according to the video content. Note that more than one thumbnail image can correspond to the same original video frame. Thumbnail images corresponding to different regions in the same frame are layered to synthesize a single frame for visualization (see Figure 34).

Figure 34: Synthesizing frames in a video skim from multiple regions-of-interest.

Figure 5: Flowchart of "smart quick view"

12.1.8 SoundProperty DS The SoundProperty DS is used to specify certain properties associated with an audio clip or the components of an audio slide show. These properties may specify the location of the audio clip, as a separate file, and/or possibly by referring to the original AV content.


 

<complexType name="SoundProperty"><element name="Title" type="mds:TextualDescription" minOccurs="0"/><element name="RefLocator" type="mds:MediaLocator" minOccurs="0"/><element name="RefTime" type="mds:MediaTime" minOccurs="0"/><element name="SyncTime" type="mds:MediaTime" minOccurs="0"/><element name="SoundLocator" type="mds:SoundLocator" minOccurs="0"/>

</complexType>


Name DefinitionSoundProperty Specifies certain properties associated with an audio clip or the

components of an audio slide show. Title The (textual) title of the audio content. This title describes the content of

the audio clip. Typically it specifies the textual title of a song. RefLocator Specifies the location of the original audio content containing a particular

audio clip or audio slide. Typically it locates a song. Contains a URL to a media and possibly a time-stamp locating the audio content within the media. See section 5.4.

RefTime Specifies time-stamps (start time and duration) of the audio slide component in the original audio content.

SyncTime Specifies time-stamps (start time and duration) of the audio clip in the audio-visual summary. It is assumed that the time stamps in the summary, that is, SyncTime elements, do not overlap in time.

SoundLocator Specifies the location of a summary audio clip. May contain a URL to a separate file; otherwise, locates an image in a composite audio of the parent SequentialSummary. See section 5.4.6.

12.1.8.3 Description ExtractionAudio clips that are components of an audio slide can be selected manually according to the relative importance of their contents.

12.1.8.4 Description ExampleThere may be several types of audio summaries, depending on whether the original content is stored in a single stream of file, or in multiple streams or files. Also, each audio slide in the summary may be either: a) part of the original content, b) part of a composite summary, or c) in a separate stream or file.

The following figure illustrates the case when there is a single source, but each audio slide component is located in a separate audio clip file.

INTROITUS: Requiem KYRIE

00:00:00-00:05:01 00:05:01-00:07:41

Requiem.aif Kyrie.aif

moz-req.aif

00:00:00-00:00:47 00:05:01-00:05:31

In this case, the top-level RefLocator associated with the entire summary indicates the location of a single source. RefLocator elements associated with each slide specify the location of each audio scene (e.g. a song, a movement) using MediaTime. RefTime indicates the time-stamps of an audio slide component, and SoundLocator locates a separate audio file using MediaURL.

<SequentialSummary name="Mozart's Requiem KV 626"> <RefLocator idref="file://Mozart/moz-req.aif"/> <SoundProperty> <Title xml:lang="de">INTROITUS: Requiem</Title> <RefLocator> <MediaTime> <MediaTimePoint><m>0</m><s>0</s></MediaTimePoint> <MediaDuration><m>5</m><s>1</s></MediaDuration> </MediaTime> </RefLocator> <RefTime> <MediaTimePoint><m>0</m><s>0</s></MediaTimePoint> <MediaDuration><s>47</s></MediaDuration> </RefTime> <SoundLocator> <MediaURL>file://Mozart/Requiem.aif</MediaURL> </SoundLocator> </SoundProperty> <SoundProperty> <Title xml:lang="de">KYRIE</Title> <RefLocator> <MediaTime> <MediaTimePoint><m>5</m><s>1</s></MediaTimePoint> <MediaDuration><m>2</m><s>40</s></MediaDuration> </MediaTime> </RefLocator> <RefTime> <MediaTimePoint><m>5</m><s>1</s></MediaTimePoint> <MediaDuration><s>30</s></MediaDuration> </RefTime> <SoundLocator> <MediaURL>file://Mozart/Kyrie.aif</MediaURL> </SoundLocator> </SoundProperty> ...</SequentialSummary>

The following figure illustrates the case when there are multiple sources, and audio slide components are all part of a composite summary clip.

In this case, the AudioSegmentLocator element locates a composite audio summary. The RefLocator specifies the location of each source using MediaURL. RefTime refers to an audio clip, which is a slide component, within the source, and SoundLocator indicates a time-stamp of an audio clip within the composite audio summary using MediaTime.

<SequentialSummary name="Two Ton Shoe Rock Album"><AudioSegmentLocator> <MediaURL>file://TwoTonShoe/Summary/TWOTON.aif</MediaURL></AudioSegmentLocator>

<SoundProperty> <Title xml:lang="en">Brothers</Title> <SourceLocator> <MediaURL>file://TwoTonShoe/TWOTON-1.aif</MediaURL> </SourceLocator> <RefTime> <MediaTimePoint><m>1</m><s>6</s></MediaTimePoint> <MediaDuration><s>9</s></MediaDuration> </RefTime> <SoundLocator> <MediaTime> <MediaTimePoint><m>0</m><s>0</s></MediaTimePoint> <MediaDuration><s>9</s></MediaDuration> </MediaTime> </SoundLocator></SoundProperty>

<SoundProperty> <Title xml:lang="en">Georgie</Title> <SourceLocator> <MediaURL>file://TwoTonShoe/TWOTON-2.aif</MediaURL> </SourceLocator> <RefTime> <MediaTimePoint><m>0</m><s>17</s></MediaTimePoint>_ <MediaDuration><s>8</s></MediaDuration> </RefTime> <SoundLocator> <MediaTime> <MediaTimePoint><m>0</m><s>10</s></MediaTimePoint> <MediaDuration><s>8</s></MediaDuration> </MediaTime> </SoundLocator></SoundProperty>...</SequentialSummary>

Editor's Note: The time related elements in these examples should be updated according to the latest Time/MediaTime DSs.

12.1.8.5 Description UseThe SoundProperty DS can be used for providing an audio slide show, and supports description of multiple contents, such as introductions or themes of multiple songs recorded within one CD album. In addition to normally playing an audio slide, some applications can use the RefLocator element in order to switch the audio clips between an audio slide component and its original. For example, one can listen to pieces of songs in the form of an audio slide show described by SoundProperty DS; and if he/she encounters some audio piece of his/her favorite song, he/she can listen to its original. The SyncTime can be used to specify time-stamps (start time and duration) of the audio clip in the audio-visual summary. If SyncTime is not used in SoundProperty DS, each component described by SoundProperty DS should be played in appearing order.

12.1.9 TextProperty DS The TextProperty DS is used to specify certain properties of textual information associated with an audio-visual summary (slide show, audio slide show, audio-visual slide show, and audio-visual skim). For example, it can specify a textual table of content of the audio-visual summary or any text that compactly describes its content fully or partially. These properties may specify: the string, and the start and end time in the summary associated with such a string, and possibly also references to the time in the original AV content.

12.1.9.1 Description Scheme Syntax 

<complexType name="TextProperty"><choice>

<element name="RefLocator" type="mds:MediaLocator" minOccurs="0"/> <element name="RefTime" type="mds:MediaTime" minOccurs="0"/>

</choice><element name="SyncTime" type="mds:MediaTime" minOccurs="0"/><element name="FreeText" type="mds:TextualDescription" minOccurs="0"/>

</complexType>


Name DefinitionTextProperty Specifies properties of textual information associated with an audio-visual

summary.RefLocator Specifies the location of the original AV program and/or time-stamps (start time

and duration) of the text associated to the original AV program. This is used if multiple AV programs are referenced in a summary.

RefTime Specifies time reference to the original AV program. It specifies time-stamps (start time and duration) of the textual information in the original AV program.

SyncTime Specifies time-stamp stamp (begin time and end time or begin time and duration) of the textual information in the audio-visual summary.

FreeText Specifies free-text string containing compact information on the audio-visual summary.

12.1.9.3 Description ExtractionFor the examples here there are basically three technologies used: 1. Shot Boundary Detection2. Time Scale Modification to speed up or slow down the audio by a factor, say N, preserving pitch and timbre. Hence

the duration of modified audio is N*d0, where d0 = duration of original audio.3. Speech Recognition and Information Retrieval Technology towards matching query terms to speech transcript and

automatic labeling (topics) of video segments.

We denote by moving storyboard (MSB) a video summary that is construct as follows:1. Apply shot boundary detection to the original video track2. Extract, say K, key frames per shot and record the duration of every shot. For the example below K=1, i.e., one key

frame per shot.

3. The video track of the MSB summary will have as many key frames as shots and the duration (intended display time) of every key frame is that of the corresponding shot.

12.1.9.4 Description ExampleThe following example shows the use of textual annotation to describe a table of content or the topics associated with an audiovisual summary. For this example, start and end time of the shots and their associated textual information in the summary video are:

Topic Start Time End Time"Introduction" 00:00:00 00:01:00 (1 min)"Business Model" 00:01:01 00:03:00 (2 min)"Stock Options" 00:03:01 00:05:00 (2 min)"Get Rich Quick" 00:05:01 00:06:00 (1 min)

<SequentialSummary name="MSB4"> 

<AudioSegmentLocator> <MediaURL>http://CueVideo/classAudio.mp3</MediaURL> <MediaTime><Duration><m>12</m><s>0</s></Duration> </AudioSegmentLocator>

<FrameProperty> <!- Duration of Frame 1 in summary video = 1 min -> <SyncTime> <Duration><m>1</m><s>0</s><nn>0</nn></Duration> </SyncTime> <ImageLocator> <MediaURL>http://CueVideo/speaker1.jpg</MediaURL> </ImageLocator> </FrameProperty> <FrameProperty> <!- Duration of Frame 2 in summary video = 6 min -> <SyncTime> <Duration><m>3</m><s>0</s><nn>0</nn></Duration> </SyncTime> <ImageLocator> <MediaURL>http://CueVideo/speaker2.jpg</MediaURL> </ImageLocator> </FrameProperty> <FrameProperty> <!- Duration of Frame 3 in summary video = 4 min -> <SyncTime> <Duration><m>2</m><s>0</s><nn>0</nn></Duration> </SyncTime> <ImageLocator> <MediaURL>http://CueVideo/speaker3.jpg</MediaURL> </ImageLocator> </FrameProperty>

<TextLocator> <SyncTime> <RelTime><s>0</s><nn>1</nn></RelTime> <Duration><m>1</m></Duration> </SyncTime> <FreeText xml:lang="en"> Introduction </FreeText> </TextLocator> <TextLocator> <SyncTime> <RelTime><s>1</s><nn>1</nn></RelTime> <Duration><m>2</m></Duration>

</SyncTime> <FreeText xml:lang="en"> Business Model </FreeText> </TextLocator> <TextLocator> <SyncTime> <RelTime><s>3</s><nn>1</nn></RelTime> <Duration><m>2</m></Duration> </SyncTime> <FreeText xml:lang="en"> Stock Options </FreeText> </TextLocator> <TextLocator> <SyncTime> <RelTime><s>5</s><nn>1</nn></RelTime> <Duration><m>1</m></Duration> </SyncTime> <FreeText xml:lang="en"> Get Rich Quick </FreeText> </TextLocator></SequentialSummary>

Editor's note: this example should be updated with respect to Time DS elements.

12.1.9.5 Description UseThe TextProperty DS can be used to synchronize textual annotation with audiovisual summaries.

Editor's note: this section needs an update.

12.2 Partitions and decompositionsThis section describes space- and frequency-views, which provide a way to specify partitions of the audio-visual data in the space or frequency domain.

12.2.1 View DS The View DS is an abstract DS that specifies a view of AV data. The View DS provides a base class for other specific types of views such as Space Views, Frequency Views, Resolution Views, Space Resolution Views, and Space Frequency Views. The different types of Views rely on a Partition DS to specify the parameters of the partition in multi-dimensional space and/or frequency. The view itself can be divided into regions if they belong to different partitions in the multi-dimensional space and/or frequency.

12.2.1.1 View SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.1.2 View SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.1.3 View ExtractionN.A.

12.2.1.4 View ExamplesThe View DS is abstract. However, for example, the subDSs of the View DS allow specification of low-resolution thumbnail versions of video, different resolution views of different spatial regions of images, or a wavelet subbands of audio.

12.2.1.5 View UseSpace and frequency views can be used in applications that involve the access and navigation of large images and video at multiple resolutions. For example, browsing applications of large aerial images and maps involve the interactive zooming-in, zooming-out and panning around the 2-D image data. Typically, each operation requires the extraction,

delivery and/or synthesis, and display of a space and frequency view of the large image. A similar framework applies to the progressive delivery of video at multiple spatial and temporal resolutions.

12.2.2 Space View DS A Space View specifies a multi-dimensional spatial view of data, which corresponds to a partition in multi-dimensional space.

12.2.2.1 SpaceView SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.2.2 SpaceView SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.2.3 SpaceView ExtractionA Space View can be extracted from the audio-visual data by sub-setting or extracting partitions of the data.

12.2.2.4 SpaceView ExamplesExamples of Space Views include regions of still images, temporal segments of video and temporal segments of audio. Figure 35.a) shows a large aerial image of which only a small subset depicting a rocky area is of interest to the user, see Figure 35.b). The Space View DS is used to describe this spatial view of the large 2-D image, see the description below.

(a) (b)

Figure 35: Aerial image (a) source: Aerial image LB_120.tif, and (b) a part of image a) based on a spatial view DS.

The SpaceView description specifies the location of the source and the view image using Media Locators and specifies the view using the coordinates of the source.

<SpaceView><Viewdata>

<MediaURL>Rocky_of_LB_120.tif</MediaURL></Viewdata><Sourcedata> <MediaURL>LB_120.tif</MediaURL></Sourcedata><SpacePartition dimensions="2" units="samples">

<Start size="2"><ValueVectorR> 20 </ValueVectorR><ValueVectorR> 40 </ValueVectorR>

</Start><End size="2">

<ValueVectorR> 100 </ValueVectorR><ValueVectorR> 100 </ValueVectorR>

</End></SpacePartition>

</SpaceView>

12.2.2.5 SpaceView UseSpace views can be used in applications that involve the access and navigation of sub-sets of the data of large images and video.

12.2.3 Frequency View DS A FrequencyView specifies a multi-dimensional view of data in frequency, which corresponds to a partition in the multi-dimensional frequency plane.

12.2.3.1 FrequencyViewSyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.3.2 FrequencyView SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.3.3 FrequencyView ExtractionA Frequency View can be extracted from the audio-visual data using a filter that in the Fourier domain corresponds to a partition in the frequency plane.

12.2.3.4 FrequencyView Examples

In general, examples of Frequency Views include spatial-frequency subbands of still images, 3-D wavelet subbands of video and temporal-frequency subbands of audio. The sample below illustrates a frequency view of an aerial image, which corresponds to a spatial-frequency subband of the image.

Figure 36: Frequency View of an Aerial image – spatial-frequency subband.

The FrequencyView description specifies the location of the source and the view image using Media Locators and specifies the view using the coordinates of the source.

<FrequencyView><ViewData>

<MediaURL>Aerial.jpg</MediaURL></ViewData><SourceData> <MediaURL>Aerial_lh.jpg</MediaURL></SourceData><FrequencyPartition dimensions="2" units="fraction" >

<Start size="2" ><ValueVectorR> 0 </ValueVectorR><ValueVectorR> 0.5 </ValueVectorR>

</Start><End size="2" >

<ValueVectorR> 0.5 </ValueVectorR><ValueVectorR> 0.5 </ValueVectorR>

</End>

</FrequencyPartition ></FrequencyView>

The examples should be updated to reflect the changes in the Schema definition (e.g. ValueVectorR)

12.2.3.5 FrequencyView UseFrequency views can be used in applications that involve the access and navigation of large images and video at multiple resolutions.

12.2.4 SpaceFrequency View DS

A Space Frequency View specifies a multi-dimensional view of data simultaneously in space and frequency, which corresponds to partitions in the multi-dimensional space and frequency planes.

12.2.4.1 SpaceFrequencyView SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.4.2 SpaceFrequencyView SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.4.3 SpaceFrequencyView ExtractionA Space Frequency View can be extracted from the audio-visual data by sub-setting the data and by using a filter that in the Fourier domain corresponds to a partition of the sub-set in the frequency plane.

12.2.4.4 SpaceFrequencyView ExamplesExamples of Space Frequency Views include spatial-frequency subbands of regions of still images, 3-D wavelet subbands of temporal segments of video and temporal-frequency subbands of temporal segments of audio.

An example of a Space Frequency View is given in Figure 37. The data of an aerial image is shown in some spatial region with all frequency components, in others only the lowband components are transmitted to provide some context information.

Figure 37: Example SpaceFrequency view of Figure 35 using a high resolution for the region of interest and a reduced resolution for the context

The SpaceFrequency DS description below describes the view in Figure 37. In the first section, the filtered outer area, in the second section the inner part of the image is described. In both cases the specification in space is done using the local pixel coordinates or samples.

<SpaceFrequencyView>

<ViewData><MediaURL>LB120_RI.tif</MediaURL>

</ViewData><SourceData>

<MediaURL>LB120_4.tif</MediaURL></SourceData><FrequencyPartition dimensions="2" units="samples">

<Start size="2" ><ValueVectorR> 0 </ValueVectorR><ValueVectorR> 0 </ValueVectorR>



</End></FrequencyPartition ><SpacePartition dimensions="2" units="samples">





</SpaceFrequencyView>

<SpaceFrequencyView><Filter dimensions="2" >

<1DFilter dimension="1" size="3" leadin="1" ><ValueVectorR> 0.25 </ValueVectorR>

<ValueVectorR> 0.50 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR>

</1DFilter ><1DFilter dimension="2" size="3" leadin="1" >

<ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.50 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR>

</1DFilter ></Filter><SpacePartition dimensions="2" unit="samples" >





</SpaceFrequencyView>


12.2.4.5 SpaceFrequencyView UseSpace Frequency views can be used in applications that involve the access and navigation of sub-sets of large images and video at multiple resolutions.

12.2.5 Resolution View DS A ResolutionView specifies a multi-dimensional view of data, which corresponds to a low-frequency region from the frequency plane.

12.2.5.1 ResolutionView SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.5.2 ResolutionView SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.5.3 ResolutionView ExtractionA Resolution View can be extracted from the audio-visual data by sub-setting the data and by using a low-pass filter that in the Fourier domain corresponds to a low-frequency partition of the sub-set in the frequency plane.

12.2.5.4 ResolutionView ExamplesExamples of Space Resolution Views include subsampled spatial-frequency subbands of regions of still images, 3-D wavelet subbands of temporal segments of video and temporal-frequency subbands of temporal segments of audio.Often it is only necessary to access an overview or low-resolution version of a large image. In this case, it is sufficient to look at the representation of an image generated by the low frequency components of the image. For this purpose the view can be described by a ResolutionView DS or, more generally by a FrequencyView DS. The ResolutionView DS describes the reduction of the image using low pass filtering. For example, Figure 38shows a low-resolution view of a large aerial image.

Figure 38: Example view of image with reduced resolution

The instantiation of the ResolutionView DS specifies the spatial frequency spectrum of the view-image by specifying the filter applied to the source data. In this case a separable filter f(k)=[0.25, 0.5, 0.25] is used for the horizontal and vertical direction. Additionally, padding options are specified for filtering. The resolution vector [0.50, 0.50] specifies that the view is one-half resolution in the horizontal and vertical directions.

<ResolutionView><Viewdata>

<MediaURL>LB120-LP.tif</MediaURL></Viewdata><Sourcedata>

<MediaURL>LB120.tif</MediaURL></Sourcedata><Resolution size="2">


</Resolution><Filter dimensions="2" >

<1DFilter dimension="1" leadin="1" size="3" ><ValueVectorR> 0.25 </ValueVectorR>


</1DFilter ><1DFilter dimension="2" leadin="1" size="3" >


</1DFilter ></Filter>

</ResolutionView>


12.2.5.5 ResolutionView UseResolution Views can be used in applications that involve the access and navigation of low- or multiple- resolution views of images, audio and video.

12.2.6 SpaceResolution View DS A SpaceResolution View specifies a multi-dimensional view of data, which corresponds to a low-frequency region from the frequency plane.

12.2.6.1 SpaceResolutionView SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.6.2 SpaceResolutionView SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.6.3 SpaceResolutionView ExtractionA Space Resolution View can be extracted from the audio-visual data by sub-setting the data and by using a filter that in the Fourier domain corresponds to a partition of the sub-set in the frequency plane and subsample mechanisms to achieve a specified resolution.

12.2.6.4 SpaceResolutionView ExamplesFigure 39.a) shows a large aerial image of wich only a small subset depicting a rocky area is of interest to the user, see Figure 39.b). The SpaceResolution View DS is used to describe this view of the large 2-D image, see the description below.

(a) (b)

Figure 39: Aerial image (a) source: Aerial image LB_120.tif, and (b) a part of image a) based on a spatial view DS.

The SpaceResolutionView description specifies the location of the source and the view image using Media Locators. Furthermore the region of the source data composing the view, and its resolution (0.5,0.5) are specified.

<SpaceResolutionView><Viewdata>

<MediaURL>Rocky_of_LB_120.tif</MediaURL></Viewdata>

<Sourcedata> <MediaURL>LB_120.tif</MediaURL></Sourcedata><SpacePartition dimensions="2" unit="samples">


</Start><End size="2">


</End></SpacePartition><Resolution size="2">


</Resolution></SpaceResolutionView>


12.2.6.5 SpaceResolutionView UseSpace Resolution views can be used in applications that involve the access and navigation of sub-sets of large images and video at multiple resolutions.

12.2.7 Filter DS The Filter DS specifies a multi-dimensional filter to characterize the loss in information of a signal to which this filter is applied to.

12.2.7.1 Filter SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.7.2 Filter SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.7.3 Filter ExtractionN.A.

12.2.7.4 Filter ExamplesThe following description of a multidimensional filter specifies a two-dimensional separable filter composed out of a separable filter f[k]={0.25,0.5,0.25}, where f[0]=0.5.

<Filter dimensions="2" ><1DFilter dimension="1" leadin="1" size="3" >


</1DFilter ><1DFilter dimension="2" leadin="1" size="3" >


</1DFilter ></Filter>


12.2.7.5 Filter UseEspecially in the domain of scientific signal analysis it is necessary to describe filters used to modify signal data to specify the loss of information.

12.2.8 1D/2D-Filter DS The 1D/2D Filter DS specifies a one-dimensional or two-dimensional filter (which has to be non-separable) to characterize the loss in information of a signal to which this filter is applied.

12.2.8.1 1D/2D-Filter SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.8.2 1D/2D-Filter SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.8.3 1D/2D-Filter ExtractionN.A.

12.2.8.4 1D/2D-Filter Examples

<1DFilter dimension="1" leadin="1" size="3" > <ValueVectorR> 0.25 </ValueVectorR>


</1DFilter >


12.2.8.5 1D/2D-Filter UseEspecially in the domain of scientific signal analysis it is necessary to describe filters used to modify signal data to specify the loss of information.

12.2.9 ViewSet DS A View Set specifies a set of views. The View Set is complete in the case that it completely covers the space and frequency planes and is incomplete when it does not. The View Set is nonredundant in the case that the views do not overlap in the space and frequency planes and redundant when the views do overlap.

12.2.9.1 ViewSet SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.9.2 ViewSet SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.9.3 ViewSet ExtractionA View Set can be extracted from the audio-visual data by extracting one or more views to form the set and by optionally evaluating the completeness and nonredundancy based on the space and frequency partition information of set of views.

12.2.9.4 ViewSet ExamplesExamples of View Sets include a set of spatial-frequency subbands of an image and a set of layers of scalable video data.

Figure 40: Example View Set with a set of Frequency Views that are image subbands. This View Set is complete and nonredundant.

The following description describes a View Set of Spatial Views that are complete and non-redundant in space:

<ViewSet complete="true" nonredundant="true"><ViewElement xsi:type="FrequencyView">

<Viewdata> <MediaURL> aerial-00.jpg </MediaURL></Viewdata><Sourcedata> <MediaURL> aerial.jpg </MediaURL></Sourcedata><SpacePartition dimensions="2">


</Start><Extent size="2">


</Extent></SpacePartition>

</ViewElement><ViewElement xsi:type="FrequencyView">

<Viewdata> <MediaURL> aerial-01.jpg </MediaURL></Viewdata><Sourcedata> <MediaURL> aerial.jpg </MediaURL></Sourcedata><FrequencyPartition dimensions="2">

<Start size="2"><ValueVectorR> 0 </ValueVectorR><ValueVectorR> 0.5 </ValueVectorR>



</Extent></FrequencyPartition>



<Start size="2"><ValueVectorR> 0.5 </ValueVectorR><ValueVectorR> 0 </ValueVectorR>






<Start size="2"><ValueVectorR> 0.5 </ValueVectorR><ValueVectorR> 0.5 </ValueVectorR>




</ViewElement></ViewSet>


12.2.9.5 ViewSet UseView Sets can be used in applications that involve the access and navigation of multiple views of images, audio and video.

12.2.10 SpaceTree DS Space Tree specifies a tree of Space Views.

12.2.10.1 SpaceTree SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.10.2 SpaceTree SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.10.3 SpaceTree ExtractionA Space Tree can be extracted, for example, from an image using a spatial quad-tree segmentation.

12.2.10.4 SpaceTree ExamplesExamples of space trees include spatial quad-tree decompositions of images and hierarchical segmentations of images, video and audio. The following example describes a space tree decomposition of an image with a splitting or branching factor of 4.

<SpaceTree branching="4" > <Index size="2" > <ValueVectorI> 0 </ValueVectorR> <ValueVectorI> 0 </ValueVectorR> </Index>

<ViewElement> <ViewData> <MediaURL> aerialST.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 256 </ValueVectorR> <ValueVectorR> 256 </ValueVectorR> </Extent> </SpacePartition> </ViewElement>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" > <ValueVectorR> 128 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 128 </ValueVectorR> <ValueVectorR> 128 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-10.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples">

<Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 128 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 128 </ValueVectorR> <ValueVectorR> 128 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-11.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" > <ValueVectorR> 128 </ValueVectorR> <ValueVectorR> 128 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 128 </ValueVectorR> <ValueVectorR> 128 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-00.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 128 </ValueVectorR> <ValueVectorR> 128 </ValueVectorR> </Extent> </SpacePartition> </ViewElement>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-00-00.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" >

<ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 64 </ValueVectorR> <ValueVectorR> 64 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-00-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" > <ValueVectorR> 64 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 64 </ValueVectorR> <ValueVectorR> 64 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-00-10.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 64 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 64 </ValueVectorR> <ValueVectorR> 64 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialST-00-11.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <SpacePartition dimensions="2" units="samples"> <Start size="2" >

<ValueVectorR> 64 </ValueVectorR> <ValueVectorR> 64 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 64 </ValueVectorR> <ValueVectorR> 64 </ValueVectorR> </Extent> </SpacePartition> </ViewElement> </Child> … </Child> </SpaceTree>


12.2.10.5 SpaceTree UseA SpaceTree can be used to describe a hierarchical decomposition of image, video or audio data in space or time. The SpaceTree provides an organization of the Space Views of the data.

12.2.11 FrequencyTree DS View is an abstract DS that specifies a view of data.

12.2.11.1 FrequencyTree SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.11.2 FrequencyTree SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.11.3 FrequencyTree ExtractionA FrequencyTree can be generated by performing a frequency decomposition of the image, video or audio data.

12.2.11.4 FrequencyTree Examples

Examples of Frequency trees include wavelet or wavelet-packet tree decompositions of image, video and audio data. The following example describes a frequency tree decomposition of an image with a splitting or branching factor of 4.

<FrequencyTree branching="4" > <Index size="2" > <ValueVectorI> 0 </ValueVectorR> <ValueVectorI> 0 </ValueVectorR> </Index>

<ViewElement> <ViewData> <MediaURL> aerialFT.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 1.0 </ValueVectorR>

<ValueVectorR> 1.0 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-10.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-11.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR>

</Extent> </FrequencyPartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-00.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-00-00.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-00-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent>

</FrequencyPartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-00-10.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </Child>

<Child> <ViewElement> <ViewData> <MediaURL> aerialFT-00-11.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </Child> … </Child> </FrequencyTree>


12.2.11.5 FrequencyTree UseA Frequency Tree can be used to describe a hierarchical decomposition of image, video or audio data in spatial- or temporal-frequency. The Frequency Tree provides an organization of the Frequency Views of the data.

12.2.12 SpaceFrequencyGraph DS A Space Frequency Graph specifies a decomposition in space and frequency.

12.2.12.1 SpaceFrequencyGraph SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.12.2 SpaceFrequencyGraph SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.12.3 SpaceFrequencyGraph ExtractionA Space Frequency Graph can be generated by performing a decomposition of the image, video or audio data in space and frequency.

12.2.12.4 SpaceFrequencyGraph ExamplesExamples of Space Frequency Graphs include graph structures that combine wavelet-packet and spatial-tree decompositions of image, video and audio data.

F S

F SF S

F SS SF S

High-resolution

Low-resolutionviews

Low-resolutionspatial views

spatial views

Figure 41: Example of Space and Frequency Graph for 2-D images. The graph organizes views of images in space and frequency such as low-resolution views, high-resolution spatial views, low-resolution spatial views, and frequency views

<SpaceFrequencyGraph branching="2" ><ViewElement> <ViewData> <MediaURL> aerialSFG-00.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Extent> </FrequencyPartition> <SpacePartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start>

<Extent size="2" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> </Extent> </SpacePartition> </ViewElement><SpaceChild branching="2" > <ViewElement> <ViewData> <MediaURL> aerialSFG-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent> </FrequencyPartition> <SpacePartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent> </SpacePartition> </ViewElement></SpaceChild><FrequencyChild branching="2" ><ViewElement> <ViewData> <MediaURL> aerialSFG-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent> </FrequencyPartition> <SpacePartition dimensions="2" units="fraction"> <Start size="2" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Extent>

</SpacePartition> </ViewElement></FrequencyChild>...</SpaceFrequencyGraph>


12.2.12.5 SpaceFrequencyGraph UseThe Space Frequency Graph allows description and organization of views of image, video and audio data in space and frequency which can be used for progressive retrieval and interactive navigation.

12.2.13 VideoViewGraph DS

A Video View Graph specifies a decomposition of video in spatial- and temporal-frequency.

12.2.13.1 VideoViewGraph SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.13.2 VideoViewGraph SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.13.3 VideoViewGraph ExtractionA VideoViewGraph can be generated by performing a frequency decomposition of video in space and time.

12.2.13.4 VideoViewGraph ExamplesAn example of a Video View Graph includes the set of views generated using a 3-D wavelet decomposition of video.

TIME

SPACE

d t=1 dt=2 dt=3

ds=1

ds=2

ds=3T

T

T

T

T

T

T

T

T

T

T

T

S S S S

S S S S

S S S S

64x64x1

512x512x64Figure 42: Example of Video View Graph. (a) Basic spatial- and temporal-frequency decomposition building block, (b)

Example video view graph of depth three in spatial- and temporal-frequency.

<VideoViewGraph branching="2" > <ViewElement> <ViewData> <MediaURL> soccer-00.mpg </MediaURL> </ViewData>

<SourceData> <MediaURL> soccer.mpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="3" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="3" > <ValueVectorR> 1 </ValueVectorR> <ValueVectorR> 1 </ValueVectorR> <ValueVectorR> 1 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> <SpaceChild branching="2" > <ViewElement> <ViewData> <MediaURL> soccer-01.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> soccer.mpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="3" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="3" > <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 0.5 </ValueVectorR> <ValueVectorR> 1 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </SpaceChild>

<FrequencyChild branching="2" ><ViewElement> <ViewData> <MediaURL> soccer-02.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> soccer.mpg </MediaURL> </SourceData> <FrequencyPartition dimensions="2" units="fraction"> <Start size="3" > <ValueVectorR> 0 </ValueVectorR> <ValueVectorR> 0 </ValueVectorR>

<ValueVectorR> 0 </ValueVectorR> </Start> <Extent size="3" > <ValueVectorR> 1 </ValueVectorR> <ValueVectorR> 1 </ValueVectorR>

<ValueVectorR> 0.5 </ValueVectorR> </Extent> </FrequencyPartition> </ViewElement> </FrequencyChild>..

.</VideoViewGraph>


12.2.13.5 VideoViewGraph UseThe Video View Graph can be used to provide multi-resolution access to video at varying spatial and temporal resolutions.

12.2.14 MultiResolutionPyramid DS Multiresolution Pyramid specifies a hierarchy of views of data.

12.2.14.1 MultiResolutionPyramid SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.14.2 MultiResolutionPyramid SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

12.2.14.3 MultiResolutionPyramid ExtractionA Multiresolution Pyramid may be generated by decomposing image, video or audio data into a hierarchy of views at different resolutions.

12.2.14.4 MultiResolutionPyramid ExamplesAn example of a Multiresolution Pyramid includes a Gaussian pyramid decomposition of an image.

<MultiResolutionPyramid level="0"> <ViewElement> <ViewData> <MediaURL> aerial-0.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <Resolution size="2" > <ValueVectorR> 0.50 </ValueVectorR> <ValueVectorR> 0.50 </ValueVectorR> </Resolution> </ViewElement> <Child level="1" > <ViewElement> <ViewData> <MediaURL> aerial-0.jpg </MediaURL> </ViewData> <SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <Resolution size="2" > <ValueVectorR> 0.25 </ValueVectorR> <ValueVectorR> 0.25 </ValueVectorR> </Resolution> </ViewElement> <Child level="2" > <ViewElement> <ViewData> <MediaURL> aerial-0.jpg </MediaURL> </ViewData>

<SourceData> <MediaURL> aerial.jpg </MediaURL> </SourceData> <Resolution size="3" > <ValueVectorR> 0.125 </ValueVectorR> <ValueVectorR> 0.125 </ValueVectorR> </Resolution> </ViewElement> </Child> </Child></MultiresolutionPyramid>


12.2.14.5 MultiResolutionPyramid UseMultiresolution Pyramid is used for progressive retrieval of image, video and audio data.

12.3 Description of variation of the content

12.3.1 Variation DS The Variations DS is used to specify variations of audio-visual data. The variations may be, in general, generated in a number of different ways, or reflect revisions of the original data. The quality of the variation compared to the original is given by a fidelity value. The type of variation is indicated by a variation type attribute. The different types of variations are described as follows:

Translation – translation involves the conversion from one modality (image, video, text, audio, synthetic model) to another. Examples of translation include text-to-speech (TTS) conversion, speech-to-text (speech recognition), video-to-image (video mosaicing), image-to-text (embedded caption recognition), and 3-D model rendering.

Summary – summarization involves the reduction of information detail. Examples of summaries include those defined in the Summary DS.

Scaling – scaling involves operations of data transcoding, manipulation and compression that result in reduction in size, quality and data rate. Examples of scaling include image, video, and audio transcoding, image size reduction, video frame dropping, color conversion and DCT coefficient scaling.

Extract – extraction involves the extraction of information from the input program. Examples of extraction include key-frame extraction from video, audio-band and voice extraction from audio, paragraph and key-term extraction from text, region, segment, object, and event extraction from audio and video.

Abstract – abstract refers to an overview of the input program in which the salient points are presented.

Substitute – substitution indicates that one program can be used to substitute for another. Examples of substitution include a text passage that replaces a photographic image when a photographic image cannot be handled by a terminal device, or an audio track that replaces a chart in a presentation.

Revision – revision indicates that the audio-visual program has been revised in some way, such as through editing or post-processing, to produce the variation.



12.3.1.3 Description ExtractionIn some cases the variations are derived from the original audio-visual data by processing. For example, in the cases of the variation type of "summary", the variation may be computed from the source audio-visual program. The variation fidelity attribute gives the quality of the variation for replacing the original for purposes of delivery under different network condition, user or publisher preferences, or capabilities of the client devices.

One procedure for computing the fidelity is based on the media attributes of the variation and source audio-visual data, as follows: consider A = original data, and B = variation of data, then

A second procedure for computing fidelity when the variation of a video uses only images, such as for a storyboard is based on the number of key-frames from video used in the summary of the video.

.

12.3.1.4 Description ExampleThe following examples give a number of different variations of a video program "news.mpg". Each variation indicates a locator for the source media, a locator for the variation media, the variation type and the fidelity of the variation with respect to the original.

Figure 43: Shows a selection screen in which the user specifies the terminal device and network characteristics in terms of screen size, screen color, supported frame rate, bandwidth and supported modalities (image, video, audio).

Figure 44: Shows the resulting selection of variations of a video news program under different terminal and network conditions. The high-rate color variation is selected for capable terminals, where the low-resolution grayscale variation is

selected for more constrained terminals.

<Variations>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.639298">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news15.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.170033">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news_aud08.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.180219">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news_aud11.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.292167">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news_aud44.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.585110">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news01gsa.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.623944">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news05gsa.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.709808">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL>

</SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news15gsa.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.681386">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news01sa.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.725080">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news05sa.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.820937">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news15sa.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.792269">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news01a.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.848132">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news05a.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.964959">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news15a.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.238315">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL>

</SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news01gs.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.282410">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news05gs.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.377421">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news15gs.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.334855">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news01s.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.384775">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news05s.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.490807">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news15s.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.446212">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL> </SourceMedia>C:\mpeg7Codexmxm27srcvariationsnews" <VariationMedia> <MediaURL>news01.avi</MediaURL> </VariationMedia> </Variation>C:\mpeg7Codexmxm27srcvariationsnews" <Variation Type="Substitution" VariationFidelity="0.509720">C:\mpeg7Codexmxm27srcvariationsnews" <SourceMedia> <MediaURL>news.mpg</MediaURL>

</SourceMedia> <VariationMedia> <MediaURL>news05.avi</MediaURL> </VariationMedia> </Variation> </Variations>

12.3.1.5 Description UseThe Variation DS can be used for a number purposes including Universal Multimedia Access (UMA). In UMA, the variations of the multimedia items can replace the original, if necessary, to adapt different multimedia presentations to the capabilities of the client terminals, network conditions or user preferences.

13 Organization of the content

13.1 Collections and classification schemes

13.1.1 Collection Structure DS

The Collection Structure DS provides the tools to describe collections of multimedia documents and collections of elements within the description of a single multimedia item (e.g. collection of segments).

13.1.1.1 Collection Structure SyntaxThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

13.1.1.2 Collection Structure SemanticsThe normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction).

13.1.1.3 Collection Structure Extraction

Descriptions of collections using the Collection Structure DS could be generated using a combination of a wide variety of methods both automatic and manual. Manual classification could be used to generate collection clusters at the higher levels of a hierarchical structure. Then, the elements of manually generated collection clusters could be further grouped into other collection clusters using automatic clustering methods [13] based on low-level features such as color histogram.

Let’s consider a collection of images. All the images in the collection could have been manually classified into two independent sets of semantic classes. The list of semantic classes and the number of images per class in a possible scenario are shown in Table 3 and Table 4. For each image in the collection, the color histogram could be extracted (see Visual XM document). Based on the color histogram values for each image, the following information could be obtained for each class:

Mean color histogram: If there are N images and ColorHistogramValue(i) represents the vector of the color histogram bins for image i (i = 1..N) , the mean color histogram for the N images could be obtained as follows:

Variance color histogram: If there are N images, ColorHistogramValue(i) represents the vector of the color histogram bins for image i (i = 1..N), and MeanColorHistogramValue represents the vector of the mean color histogram for the N images, the variance color histogram for the N image could be obtained as follows:

Table 3: First classification of the images in a collection and number of images per class.Class Number Class NumberFlower GardenRock and skyNews anchorWalking peopleBaldheaded man walk/talking personsSports reporters in the rainCongressBaldheaded man representingCastleBlack clothes lady on the blue matSinger with studio lightsStrange hairLeather jacket people

4610172498327655105

Quiz SceneSpeakerMan and horseSpace earthFountainGraphics before newsRon ReaganBasketball game overlayGlass roofSnow clad mountainOutdoor/boatsBy the waterCoupleShop

46647994446456

Man with placardPeople on the redSnakeFishTapirsButterflySmall monkey with bananaLandscape Image 1Landscape Image 2Landscape Image 3Indoor ImageAnchorperson

512111211124431217

Flower (indoor)Playing on the streetRoad with trees/grassChildren/rock/grassAsian buildingContainersSunset over lakeBig pipesMan with sunglasses in white shirtWooden shackRuins

45445486369

Table 4: Second classification of the images in a collection and number of images per class.Class Number Class NumberBeautyReportingPoliticsEntertainmentGrowing upLivingHappiness

26459659101210

Animal kingdomSnowRiverHuman achievementTravelArtDesolation

6487264515

The process of instantiating the Collection Structure DS with this information for each class could be as follows. A root collection cluster could be defined to represent the entire collection of images. The two different classification schemes could be represented as two different cluster decompositions of semantic type from the root collection cluster. Each class in each classification scheme could then be represented as a collection cluster with the following attributes: (1) an Annotation DS containing the semantic label associated with the class; (20 the number of elements in the class; (3) a Cluster Statistics DS containing the mean and variance color histogram; (4) a Representative Icons DS providing the URL to the icon of the first elements in the class; (4) and a Cluster Creation DS specifying that a manual method was used to create the classes. Each collection cluster associated with a class could also contain a Cluster Decomposition DS with the images or elements of the class.

A section of a possible image collection description as described in this section is included in the next section.

13.1.1.4 Collection Structure Examples

See below for a section of a possible image collection description as described in the previous section.



<CollectionStructure>



<CollectionCluster id="CC_0"> <ClusterAttribute xsi:type="NumberElements"> value="387"/></ClusterAttribute>



<ClusterRelationship xsi:type="ClusterDecomposition"

type="Structure Topological" name="Cluster Decomposition" degree="2" DecompositionType="Semantic" overlaps="false" gaps="false">



<CollectionCluster id="CC_0_0">



<Annotation> <TextAnnotation>Flower Garden</TextAnnotation> </Annotation>



<ClusterAttribute xsi:type="ClusterCreation"> <Method mode="Manual"/> </ClusterAttribute> <ClusterAttribute xsi:type="RepresentativeIcons"> <MediaLocator> <MediaURL>icons/0.jpg</MediaURL> </MediaLocator> </ClusterAttribute>



<ClusterAttribute xsi:type="NumberElements"> value="4"/></ClusterAttribute>



<ClusterAttribute xsi:type="ClusterStatistics" DescriptorName="ColorHistogram"> <Mean Size="166"> <ValueVectorR>22068.75</ValueVectorR> <ValueVectorR>4919.0</ValueVectorR> <!— Other Value R Value --> </Mean> <Variance Size="168"> <ValueVectorR>1.28529886875E7</ValueVectorR> <ValueVectorR>1091589.5</ValueVectorR> <!— Other R Value --> </Variance> </ClusterAttribute>





<ClusterRelationship xsi:type="ClusterDecomposition" type="Structure Topological" name="Cluster Decomposition" degree="2" DecompositionType="Elements" overlaps="false" gaps="false"> <ClusterNode> <GenericDS> <MediaInformation> <MediaProfile> <MediaInstance> <InstanceLocator> <MediaURL>imgs/img00587_add3.jpg</MediaURL> </InstanceLocator> </MediaInstance> </MediaProfile> </MediaInformation> </GenericDS> <GenericDS> <MediaInformation> <MediaProfile> <MediaInstance> <InstanceLocator> <MediaURL>imgs/img00585_add3.jpg</MediaURL> </InstanceLocator> </MediaInstance> </MediaProfile> </MediaInformation> </GenericDS> <GenericDS> <MediaInformation> <MediaProfile> <MediaInstance> <InstanceLocator> <MediaURL>imgs/img00586_add3.jpg</MediaURL> </InstanceLocator> </MediaInstance> </MediaProfile> </MediaInformation> </GenericDS> <GenericDS> <MediaInformation> <MediaProfile> <MediaInstance> <InstanceLocator> <MediaURL>imgs/img00588_add3.jpg</MediaURL> </InstanceLocator> </MediaInstance> </MediaProfile> </MediaInformation> </GenericDS> </ClusterNode> </ClusterRelationship> </CollectionCluster>



<CollectionCluster id="CC_0_1"> <Annotation> <TextAnnotation>rock and sky</TextAnnotation> </Annotation> <ClusterAttribute xsi:type="ClusterCreation"> <Method mode="Manual"/> </ClusterAttribute> <ClusterAttribute xsi:type="RepresentativeIcons"> <MediaLocator> <MediaURL>icons/1.jpg</MediaURL> </MediaLocator> </ClusterAttribute> <ClusterAttribute xsi:type="NumberElements"> value="6"/></ClusterAttribute> <ClusterAttribute xsi:type="ClusterStatistics" DescriptorName="ColorHistogram">  </ClusterAttribute> <ClusterRelationship xsi:type="ClusterDecomposition" …> <!—- Class decomposition in images goes here --> </ClusterRelationship> </CollectionCluster>



</ClusterRelationship>



<ClusterRelationship xsi:type="ClusterDecomposition" type="Structure Topological" name="Cluster Decomposition" degree="2" DecompositionType="Semantic" overlaps="false" gaps="false">



<CollectionCluster id="CC_0_50"> <Annotation> <TextAnnotation>Beauty</TextAnnotation> </Annotation> <ClusterAttribute xsi:type="ClusterCreation"> <Method mode="Manual"/> </ClusterAttribute> <ClusterAttribute xsi:type="RepresentativeIcons"> <MediaLocator> <MediaURL>icons/0.jpg</MediaURL> </MediaLocator> </ClusterAttribute> <ClusterAttribute xsi:type="NumberElements"> value="26"/></ClusterAttribute>

<ClusterAttribute xsi:type="ClusterStatistics" DescriptorName="ColorHistogram">  </ClusterAttribute> <ClusterRelationship xsi:type="ClusterDecomposition" …> <!—- Class decomposition in images goes here --> </ClusterRelationship> </CollectionCluster>



</ClusterRelationship> </CollectionCluster>



</CollectionStructure>

13.1.1.5 Collection Structure UseBased on an image collection description described in the two previous sections, an image collection browsing and searching application could be developed providing the following functionality:

Browse the images in the collection Browse the classes in the collection Browse the images belonging to a class Browse the classes assigned to each image Retrieve classes to a query one in terms of mean color histogram

In the class browsing system, the user could view the classes as a simple flat list even though these include classes from two independent classifications of the images in the collection. Another way to show the classes to users would be in a hierarchical fashion. For each class, a semantic label could be displayed together with the representative icon of the class. Users could view the images that belong to each class by clicking with the mouse on the name or the representative icon of the class.

In the image browsing system, users could view the classes to which an image belongs by clicking with the mouse on the icon of the image. Again, by selecting a class, users could browse the images that belong to that class.

When browsing though the collection classes, a text "ch" right below the class representative icon could start a class query based on mean color histogram. Similar classes to the selected one based on the mean color histogram would be returned. Euclidean distance or any other similarity matching functions could be used to compare the mean color histograms of two classes (see Visual XM document).

13.2 Models

13.2.1 Model DS The Model DS is used for analysis and classification of audio-visual data.






13.3 Probability Models

13.3.1 ProbabilityModel DS The Probability Model DS is used to specify statistical functions and probabilistic structures. The Probability Model DS can be used for representing samples of audio-visual data and classes of descriptors using statistical approximation.




13.3.1.4 Description ExampleState-transition models can be used to model the evolution of events along the temporal dimension. For example, a temporal sequence of scenes in video can be characterized by a state-transition model that describes the probabilities of transitions between scenes.

<StateTransitionModel> <Transitions size1="20" size2="20"> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.210526 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0526316 0.157895 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0526316 0.210526 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0526316 0 0 0 0 0 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Transitions> <Initial size="20"> 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 </Initial> <State label="0 players" confidence="1"> <State label="1 player" confidence="0.707107"> <State label="2 players" confidence="0.57735"> <State label="3 players" confidence="0.5"> <State label="4 players" confidence="0.447214"> <State label="5 players" confidence="0.408248"> <State label="6 players" confidence="0.377964"> <State label="7 players" confidence="0.353553"> <State label="8 players" confidence="0.333333"> <State label="9 players" confidence="0.316228"> <State label="10 players" confidence="0.301511"> <State label="11 players" confidence="0.288675"> <State label="12 players" confidence="0.27735"> <State label="13 players" confidence="0.267261"> <State label="14 players" confidence="0.258199"> <State label="15 players" confidence="0.25"> <State label="16 players" confidence="0.242536"> <State label="17 players" confidence="0.235702"> <State label="18 players" confidence="0.229416"> <State label="19 players" confidence="0.223607"></StateTransitionModel>

13.3.1.5 Description UseMatching metrics – four different state-transition model matching metrics have been investigated. Here are their brief descriptions.Euclidean distance: calculates the sum of squared differences between transition probabilities.

Quadratic distance: calculates the sum of weighted quadratic distance between transition probabilities.

Weighted transition frequency: calculates the weighted sum of ratios of transition probabilities.

Euclidean distance of aggregated state transitions: calculates the sum of the squared differences of aggregated transitions.

13.3.2 Gaussian DS The Gaussian DS is used to specify a multi-dimensional probability distribution in terms of low-order Gaussian approximation. The Gaussian DS includes statistical measures of multi-dimensional centroid and variance represented by either a variance vector or covariance matrix.



13.3.2.3 Description Extraction

The Gaussian DS can be computed from a sample set of vectors or descriptors. Given N sample vectors , the mean of

the Gaussian is computed from .

13.3.2.4 Description ExampleThe following Gaussian DS description example gives the mean and variance of eleven 166-dimensional color histograms. In this example, both the mean and variance are given as 166-dimensional vectors.

<Gaussian> <Mean> 4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09 7075 2428.64 42.9091 0 0 0 0 0 0 0 0 0 1 42.7273 27.3636 196 190.364 50.4545 8555.55 3645.36 165 0 0 0 0 0 0 0 0 0 0 23.5455 7.09091 3.63636 3.81818 0.272727 4.72727 2 0.363636 0 0 0 0 0 0 0 0 0 11102 1 1960.55 6776.27 2595.36 2684.55 681.727 1282.36 3228.91 3111.55 3.36364 0 0 0 0 0 0 0 0 0 2157.09 6249.36 46 0 0 13 10.9091 4204.64 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1601 0 1922.45 5089.18 33.2727 4 0.0909091 5.18182 97.4546 355.364 1.54545 0 0 0 0 0 0 0 0 0 1206.64 1527.45 0 0 0 0 0 118.818 1.27273 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Mean> <Variance> 1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006 6.0651e+006 7.04084e+006 6.8031e+006 5.8579e+007 7.53956e+006 2184.36 0 0 0 0 0 0 0 0 0 10 2014.36 846 42045.6 37484.9 3788 7.51059e+007 1.40304e+007 30853.6 0 0 0 0 0 0 0 0 0 0 653.273 58.9091 20.3636 32.3636 0 29.6364 5.09091 0 0 0 0 0 0 0 0 0 0 1.24205e+008 2.90909 3.93814e+006 4.63218e+007 6.85744e+006 7.56771e+006 506779 1.7974e+006 1.11572e+007 1.00054e+007 22.1818 0 0 0 0 0 0 0 0 0 4.93268e+006 3.96097e+007 3098.91 0 0 605.818 329.818 1.80207e+007 6789.27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.91941e+006 0 3.73126e+006 2.60184e+007 1282.55 35.4545 0 32.9091 11217.8 131619 12 0 0 0 0 0 0 0 0 0 1.48855e+006 2.59591e+006 0 0 0 0 0 38052.5 3.45455 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Variance></Gaussian>

13.3.2.5 Description UseThe Gaussian DS can be used provide a compact statistical approximation of a class of feature vectors or descriptors. The Gaussian DS also can be used within a ProbabilityModelClass DS to represent the class in terms of a Gaussian probability model, semantic label and confidence.

13.4 Analytic ModelsThe Analytic model DS is used to specify a sample set of audio-visual data, class of descriptors, group of classes, or cluster of descriptors. The Analytic model DS gives a semantic label represented by a string, which is used by the derived DSs to give a semantic label to each model. The Analytic model DS also optionally indicates the confidence in which the semantic label is assigned to the model.

13.4.1 AnalyticModel DS

13.4.1.1 Description Scheme SyntaxA number of normative components associated with this element are specified in the MDS WD associated with the current version of this document. (see Introduction). The State DS is currently undergoing XM software integration.





13.4.2 Cluster DS The Cluster DS is used to specify a group of audio-visual data. In general, the elements of the cluster can be audio samples, images, regions, segments, video programs, and so forth. The Cluster DS allows the assignment of a semantic label to the cluster.




13.4.2.4 Description ExampleThe following Cluster DS description example gives a cluster of images that belong to a semantic class of "fish" images. In this example, the Cluster has eleven JPEG images.

<Cluster Length="11" SemanticLabel="fish" Confidence="0.9"> <MediaLocator>i0323_add5.jpg</MediaLocator> <MediaLocator>i0324_add5.jpg</MediaLocator> <MediaLocator>i0325_add5.jpg</MediaLocator> <MediaLocator>i0326_add5.jpg</MediaLocator> <MediaLocator>i0327_add5.jpg</MediaLocator> <MediaLocator>i0328_add5.jpg</MediaLocator> <MediaLocator>i0329_add5.jpg</MediaLocator> <MediaLocator>i0330_add5.jpg</MediaLocator> <MediaLocator>i0331_add5.jpg</MediaLocator> <MediaLocator>i0332_add5.jpg</MediaLocator> <MediaLocator>i0333_add5.jpg</MediaLocator></Cluster>

13.4.2.5 Description UseThe Cluster DS can be used to group audio-visual data items and assign them a single semantic label. This is useful in representing a semantic class of audio-visual data. The clusters can also be used within a ClusterSet DS to form a classifier based on the semantic classes.

13.4.3 Examples DS The Examples DS describes a group of descriptors. The Examples DS allows the assignment of a semantic label to the group.




13.4.3.4 Description ExampleThe following Examples DS description gives a group of descriptors that belong to a semantic class of "baldheaded man walking and talking with persons" images. This example gives a group of three color histogram descriptors, which are assigned a single semantic label.

<Examples SemanticLabel="baldheaded man walking" Length="3" Confidence="1.0" DescriptorName="ColorHistogram"> <Descriptor> 4617 11986 938 2628 458 1463 5178 2258 444 134 69 456 9300 2810 121 21 14 18 48 107 277 53 47 1926 8281 793 38 11 0 5 201 28 0 1 1 2 23 252 122 6 3 433 1517 46 1 1 0 0 0 0 0 0 0 0 2 55 13560 3326 678 221 1610 5602 916 32 8 1 21 58 11 1 0 0 2 61 331 179 14 7 2388 6213 51 0 0 0 0 0 0 0 0 0 0 2 337 243 0 0 220 194 0 0 0 0 0 0 0 0 0 0 0 0 383 3172 1072 51 20 91 128 0 0 0 0 0 2 4 0 0 0 0 89 757 694 0 0 217 39 0 0 0 0 0 0 0 0 0 0 0 0 912 210 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 55 </Descriptor>

<Descriptor> 1764 18807 725 816 553 1784 7133 1325 81 3 8 110 5621 2323 34 11 0 3 12 82 156 26 11 700 3060 63 7 0 0 0 1 0 0 1 0 0 16 95 40 4 0 16 20 1 0 0 0 0 0 0 0 0 0 0 0 17 13534 3211 523 126 1123 5181 347 37 0 0 0 5 8 2 1 0 2 17 261 168 3 0 997 2635 3 0 0 0 0 0 0 0 0 0 0 2 292 39 0 0 17 1 0 0 0 0 0 0 0 0 0 0 0 0 157 861 430 3 0 26 14 0 0 0 0 0 0 0 0 0 0 0 21 608 215 0 0 81 1 0 0 0 0 0 0 0 0 0 0 0 0 373 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9

</Descriptor> <Descriptor> 9742 15760 1455 2216 475 1356 4771 2328 714 329 193 420 6954 6087 298 15 15 22 35 119 74 115 24 1253 7629 352 14 5 1 3 85 99 0 0 0 0 0 11 0 6 0 335 717 9 0 0 0 0 0 0 0 0 0 0 0 0 12332 3066 991 157 1048 4836 469 14 1 0 0 160 80 4 0 0 0 13 217 101 53 0 3450 6079 12 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 338 64 0 0 0 0 0 0 0 0 0 0 0 0 0 2439 718 15 0 81 41 0 0 0 0 0 0 0 0 0 0 0 0 65 0 0 0 447 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor></Examples>

13.4.3.5 Description UseThe Examples DS can be used to group descriptors and assign them a semantic label. This is useful in representing a feature class or semantic class of descriptors. The Examples DS can also be used within an ExamplesSet DS to form a classifier based on the semantic classes.

13.4.4 ProbabilityModelClass DS The ProbabilityModelClass DS is used to specify a class of descriptors in terms of a probability model. The confidence value that is associated with the probability model class, which indicates the confidence of the assignment of the semantic label, may be based on the sample size for the probability model.




13.4.4.4 Description ExampleThe following ProbabilityModelClass DS description example gives a Gaussian representation of the class with label "fish" in terms of the mean and variance of the color histograms in the class.

<ProbabilityModelClass SemanticLabel="fish" Confidence="0.5" DescriptorName="ColorHistogram"> <Gaussian> <Mean> 4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09 7075 2428.64 42.9091 0 0 0 0 0 0 0 0 0 1 42.7273 27.3636 196 190.364 50.4545 8555.55 3645.36 165 0 0 0 0 0 0 0 0 0 0 23.5455 7.09091 3.63636 3.81818 0.272727 4.72727 2 0.363636 0 0 0 0 0 0 0 0 0 11102 1 1960.55 6776.27 2595.36 2684.55 681.727 1282.36 3228.91 3111.55 3.36364 0 0 0 0 0 0 0 0 0 2157.09 6249.36 46 0 0 13 10.9091 4204.64 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1601 0 1922.45 5089.18 33.2727 4 0.0909091 5.18182 97.4546 355.364 1.54545 0 0 0 0 0 0 0 0 0 1206.64 1527.45 0 0 0 0 0 118.818 1.27273 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

</Mean> <Variance> 1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006 6.0651e+006 7.04084e+006 6.8031e+006 5.8579e+007 7.53956e+006 2184.36 0 0 0 0 0 0 0 0 0 10 2014.36 846 42045.6 37484.9 3788 7.51059e+007 1.40304e+007 30853.6 0 0 0 0 0 0 0 0 0 0 653.273 58.9091 20.3636 32.3636 0 29.6364 5.09091 0 0 0 0 0 0 0 0 0 0 1.24205e+008 2.90909 3.93814e+006 4.63218e+007 6.85744e+006 7.56771e+006 506779 1.7974e+006 1.11572e+007 1.00054e+007 22.1818 0 0 0 0 0 0 0 0 0 4.93268e+006 3.96097e+007 3098.91 0 0 605.818 329.818 1.80207e+007 6789.27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.91941e+006 0 3.73126e+006 2.60184e+007 1282.55 35.4545 0 32.9091 11217.8 131619 12 0 0 0 0 0 0 0 0 0 1.48855e+006 2.59591e+006 0 0 0 0 0 38052.5 3.45455 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Variance> </Gaussian></ProbabilityModelClass>

13.4.4.5 Description UseThe ProbabilityModelClass DS can be used to represent with a single semantic label in terms of a ProbabilityModel. This is useful in statistically representing a feature class or semantic class of descriptors. The ProbabilityModel DS can also be used to form a classifier based on the semantic classes.

13.5 Classifiers

13.5.1 Classifier DS The Classifier DS provides a way to describe the different types of classifiers that are used to assign semantic labels to audio-visual data.






13.5.2 ClusterSet DS The ClusterSet DS describes a set of clusters that each represent a different semantic concept. This allows the ClusterSet DS to be used for purposes of classification whereby audio-visual data is assigned semantic labels based on the cluster sets.




13.5.2.4 Description ExampleThe following ClusterSet DS description example gives two clusters with labels "Flower Garden" and "rock and sky", respectively. Each cluster is represented using a Cluster which gives the images that belong to the respective cluster. The ClusterSet has a confidence value that indicates how well the cluster set can be used for classifying audio-visual data.

<ClusterSet Confidence="0.9" Length="2"><Cluster SemanticLabel="Flower Garden" length="4">

<MediaLocator>img00587_add3.jpg</MediaLocator> <MediaLocator>img00585_add3.jpg</MediaLocator> <MediaLocator>img00586_add3.jpg</MediaLocator> <MediaLocator>img00588_add3.jpg</MediaLocator>

</Cluster>

<Cluster SemanticLabel="rock and sky" length="6"> <MediaLocator>img0066d_s1.jpg</MediaLocator> <MediaLocator>img0063d_s1.jpg</MediaLocator> <MediaLocator>img0064d_s1.jpg</MediaLocator> <MediaLocator>img0065d_s1.jpg</MediaLocator> <MediaLocator>img0067d_s1.jpg</MediaLocator> <MediaLocator>img0054d_s1.jpg</MediaLocator>

</Cluster></ClusterSet>

13.5.2.5 Description UseThe ClusterSet DS can be used to represent a sets of audio-visual data using clusters of samples. This allows the ClusterSet DS to be used for classifying audio-visual data and assigning semantic labels.

13.5.3 ExamplesSet DS The ExamplesSet DS describes a set of example groups where each example group represents a different semantic concept. This allows the ExamplesSet DS to be used for purposes of classification whereby based on descriptor values, audio-visual data is assigned semantic labels.




13.5.3.4 Description ExampleThe following ExamplesSet DS description example gives two groups of examples with labels "Flower Garden" and "rock and sky", respectively. Each group is represented using a Examples which gives the descriptors that belong to the respective group. The examples set has a confidence value that indicates how well the example set can be used for classifying audio-visual data.

<ExamplesSet confidence="0.9" Length="2"><Examples SemanticLabel="Flower Garden" DescriptorName="ColorHistogram" length="4">

<Descriptor> 18458 5456 318 677 4056 1779 1013 90 2 1 0 6 413 679 12 3 2 2 14 167 1975 449 9375 1689 87 0 0 0 0 0 22 33 0 0 0 0 8 1329 2818 123 2689 390 1 0 0 0 0 0 0 0 0 0 0 0 2 866 3388 1280 569 1034 47 0 0 0 0 0 0 2 37 0 0 0 0 131 1855 2768 123 473 0 0 0 0 0 0 0 0 0 0 0 0 0 53 3283 969 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 158 3391 357 75 15 0 0 0 0 0 0 0 0 0 0 0 0 0 258 1182 173 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 170 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor> 18848 5341 616 710 1700 1490 658 115 51 39 47 405 1976 481 98 28 18 18 60 299 2559 535 2489 1828 36 0 0 0 0 39 2242 126 0 0 0 0 14 2016 2581 93 192 216 0 0 0 0 0 0 363 80 0 0 0 0 5 1152 2913 2067 755 655 70 9 0 0 0 0 0 574 176 5 1 0 1 135 1856 4327 97 264 6 0 0 0 0 0 0 634 10 0 0 0 0 74 2689 722 0 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 83 2028 765 129 45 1 0 0 0 0 0 141 2888 0 0 0 0 0 122 1229 285 0 0 0 0 0 0 0 0 0 1355 0 0 0 0 0 1 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor>

27012 5751 431 1357 3030 1461 991 85 9 7 3 32 752 608 55 14 9 5 51 303 879 149 6682 3069 109 0 0 0 0 1 1090 305 3 3 0 0 32 1361 597 7 1523 756 0 0 0 0 0 0 306 147 1 0 0 0 2 552 1683 449 689 851 1 0 0 0 0 0 0 281 85 3 4 8 1 28 970 356 425 1557 0 0 0 0 0 0 0 683 44 0 0 0 0 2 971 29 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 20 936 13 111 13 0 0 0 0 0 0 0 667 3 0 0 0 0 5 115 0 67 7 0 0 0 0 0 0 0 6170 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor> 23957 3128 524 1225 6871 2516 3243 325 28 16 9 13 179 716 3 1 1 2 14 192 1653 479 12859 871 196 1 0 0 0 1 58 856 0 0 0 0 11 803 1221 42 3516 13 0 0 0 0 0 0 12 12 0 0 0 0 1 238 2144 943 238 362 1 0 0 0 0 0 0 15 2 0 0 0 0 100 1138 1174 1 478 0 0 0 0 0 0 0 10 3 0 0 0 0 4 742 48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3134 98 4 14 0 0 0 0 0 0 0 0 0 0 0 0 0 35 279 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> </Examples>

<Examples SemanticLabel="rock and sky" DescriptorName="ColorHistogram" length="6"> <Descriptor> 26 8103 6380 4909 2 0 0 0 0 0 0 0 3044 648 0 0 0 0 0 7 33 80 0 0 0 0 0 0 0 0 24153 51 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17059 4686 20070 4 0 0 0 0 0 0 0 3 0 0 0 0 0 0 7 0 4 0 0 0 0 0 0 0 0 105 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3316 482 5130 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor> 117 18964 1017 2768 29 0 0 0 0 0 0 0 13223 1787 0 0 0 0 0 0 8 32 0 0 0 0 0 0 0 0 24186 1186 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8014 4309 18033 0 0 0 0 0 0 0 0 1778 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 358 679 1788 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

</Descriptor> <Descriptor> 14 10755 6889 4398 8 0 0 0 0 0 0 0 6670 460 0 0 0 0 0 3 19 45 0 0 0 0 0 0 0 0 16713 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14357 3858 20265 0 0 0 0 0 0 0 0 7946 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1029 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3177 187 1481 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor> 30 6478 3374 2756 3 0 0 0 0 0 0 0 1849 313 1 0 0 0 0 5 39 101 0 0 0 0 0 0 0 0 22605 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13746 4198 19434 2 0 0 0 0 0 0 0 3474 23 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 7025 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4263 648 7931 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor> 0 2299 7686 8177 0 0 0 0 0 0 0 0 22523 705 0 0 0 0 0 3 17 60 0 0 0 0 0 0 0 0 33800 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7483 2454 8978 0 0 0 0 0 0 0 0 400 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 188 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1868 295 1368 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> <Descriptor> 5 13396 3142 1872 68 0 0 0 0 0 0 0 34390 12732 0 0 0 0 0 3 11 14 0 0 0 0 0 0 0 0 616 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12213 2982 6420 0 0 0 0 0 0 0 0 8165 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1649 40 553 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Descriptor> </Examples></ExamplesSet>

13.5.3.5 Description UseThe ExamplesSet DS can be used to represent a set of feature classes using sets of descriptor examples. This allows the ExamplesSet DS to be used for classifying descriptors and assigning semantic labels based on descriptor values.

13.5.4 ProbabilityModelClassifier DS

The Probability Model Classifier DS describes a set of classes in terms of probability models for representing different semantic concepts. This allows the Probability Model Classifier DS to be used for classifying and assigning semantic labels to audio-visual data based on the classes.




13.5.4.4 Description ExampleThe following ProbabilityModelClassier DS description example gives two classes with labels "fish" and "flower garden", respectively. Each class is represented using a ProbabilityModelClassDS that gives a Gaussian representation in terms of the mean and variance of the color histograms in the class.

<ProbabilityModelClassifier confidence="0.9" length="2"> <ProbabilityModelClass SemanticLabel="fish" Confidence="0.5" DescriptorName="ColorHistogram"> <Gaussian> <Mean> 4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09 7075 2428.64 42.9091 0 0 0 0 0 0 0 0 0 1 42.7273 27.3636 196 190.364 50.4545 8555.55 3645.36 165 0 0 0 0 0 0 0 0 0 0 23.5455 7.09091 3.63636 3.81818 0.272727 4.72727 2 0.363636 0 0 0 0 0 0 0 0 0 11102 1 1960.55 6776.27 2595.36 2684.55 681.727 1282.36 3228.91 3111.55 3.36364 0 0 0 0 0 0 0 0 0 2157.09 6249.36 46 0 0 13 10.9091 4204.64 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1601 0 1922.45 5089.18 33.2727 4 0.0909091 5.18182 97.4546 355.364 1.54545 0 0 0 0 0 0 0 0 0 1206.64 1527.45 0 0 0 0 0 118.818 1.27273 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

</Mean> <Variance> 1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006 6.0651e+006 7.04084e+006 6.8031e+006 5.8579e+007 7.53956e+006 2184.36 0 0 0 0 0 0 0 0 0 10 2014.36 846 42045.6 37484.9 3788 7.51059e+007 1.40304e+007 30853.6 0 0 0 0 0 0 0 0 0 0 653.273 58.9091 20.3636 32.3636 0 29.6364 5.09091 0 0 0 0 0 0 0 0 0 0 1.24205e+008 2.90909 3.93814e+006 4.63218e+007 6.85744e+006 7.56771e+006 506779 1.7974e+006 1.11572e+007 1.00054e+007 22.1818 0 0 0 0 0 0 0 0 0 4.93268e+006 3.96097e+007 3098.91 0 0 605.818 329.818 1.80207e+007 6789.27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.91941e+006 0 3.73126e+006 2.60184e+007 1282.55 35.4545 0 32.9091 11217.8 131619 12 0 0 0 0 0 0 0 0 0 1.48855e+006 2.59591e+006 0 0 0 0 0 38052.5 3.45455 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Variance> </Gaussian> </ProbabilityModelClass>

<ProbabilityModelClass SemanticLabel="Flower Garden" Confidence="0.75" DescriptorName="ColorHistogram"> <Gaussian> <Mean> 18653 5398.5 467 693.5 2878 1634.5 835.5 102.5 26.5 20 23.5 205.5 1194.5 580 55 15.5 10 10 37 233 2267 492 5932 1758.5 61.5 0 0 0 0 19.5 1132 79.5 0 0 0 0 11 1672.5 2699.5 108 1440.5 303 0.5 0 0 0 0 0 181.5 40 0 0 0 0 3.5 1009 3150.5 1673.5 662 844.5 58.5 4.5 0 0 0 0 0 288 106.5 2.5 0.5 0 0.5 133 1855.5 3547.5 110 368.5 3 0 0 0 0 0 0 317 5 0 0 0 0 63.5 2986 845.5 0 0 0 0 0 0 0 0 0 6 0.5 0 0 0 0 0 120.5 2709.5 561 102 30 0.5 0 0 0 0 0 70.5 1444 0 0 0 0 0 190 1205.5 229 0 0 0 0 0 0 0 0 0 677.5 0 0 0 0 0 3 140.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Mean> <Variance> 3.47954e+008 2.91417e+007 239823 480521 9.66769e+006 2.69084e+006 728731 10560 1276 741 1081 81825 2.03638e+006 345621 4819 381 154 154 1861 58412 5.22229e+006 243421 4.70369e+007 3.09539e+006 4371 0 0 0 0 741 2.51239e+006 8403 0 0 0 0 119 2.91358e+006 7.29864e+006 11781 3.63235e+006 99075 0 0 0 0 0 0 65703 3160 0 0 0 0 11 1.03752e+006 9.97891e+006 2.95377e+006 446231 748246 3496 36 0 0 0 0 0 164452 16066 10 0 0 0 17560 3.44103e+006 1.31888e+007 12159 146344 15 0 0 0 0 0 0 200661 45 0 0 0 0 4079 9.00142e+006 729277 0 0 0 0 0 0 0 0 0 66 0 0 0 0 0 0 15806 7.80312e+006 355776 11031 1095 0 0 0 0 0 0 9870 4.16883e+006 0 0 0 0 0 40534 1.45258e+006 55348 0 0 0 0 0 0 0 0 0 917335 0 0 0 0 0 10 20470 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </Variance> </Gaussian> </ProbabilityModelClass></ProbabilityModelClassifier>

13.5.4.5 Description UseThe ProbabilityModelClassfier DS can be used to represent a set of feature classes using probability models. This allows the ProbabilityModelClassifier DS to be used for classifying descriptors and assigning semantic labels based on descriptor values.

14 User Interaction

14.1 User Preferences

14.1.1 UserPreference DS The UserPreference DS is used to describe user’s preferences pertaining to consumption of multimedia material. User preference descriptions can be correlated with media descriptions to find and consume desired content. Correspondence between user preference and media descriptions facilitate accurate and efficient personalization of content access and content consumption.




14.1.1.4 Description Example<UserPreference> <UserIdentifier protection="true" userName="Yoon"/> <UsagePreferences allowAutomaticUpdate="false"> <FilteringAndSearchPreferences> . . . </FilteringAndSearchPreferences> <BrowsingPreferences> . . . </BrowsingPreferences> </UsagePreferences></UserPreference>

14.1.1.5 Description UseUser preference descriptions are used by consumers (or their agents) for accessing and consuming multimedia content that fits their personal preferences. A generic usage model is depicted in Figure 45 below, where a user agent takes media descriptions and user preferences as input and generates a filtered output containing descriptions of media that fit personal preferences. In specific applications, the output may be media locators of preferred media, or a summary of an audio-visual program where the type of the summary satisfies user’s summary preferences. For example, a user may prefer to view only the goals of a soccer match, but another user may prefer a 30-minute highlight summary of the match.

Figure 45: A generic usage model for user preference and media descriptions.

14.1.2 UserIdentifier DS The UserIdentifier DS is used to identify a particular description of user preferences. A single person can use multiple identifiers, each of which identifies a different set of user preferences.




14.1.2.4 Description Example<UserPreference> <UserIdentifier protection="true" userName="Jane"/> <UsagePreferences allowAutomaticUpdate="false"> . . . </UsagePreferences></UserPreference>

14.1.2.5 Description UseThe User Identifier DS is used to identify a particular user preference description and distinguish it from other user preference descriptions. It is possible for the same user to have multiple user preference descriptions, each identified by a different instantiation of userName, for use under different usage conditions.

14.1.3 PreferenceType DS The PreferenceType DS is used to specify a combination of time and/or place that can be associated with a particular set of user preferences. The PreferenceType DS is used to specify a dependency of user preferences on time and location.




14.1.3.4 Description ExampleSee sections 14.1.6.4 and 14.1.8.4.

14.1.3.5 Description UseThe PreferenceType DS is used to specify user’s usage condition dependent on time and location. The PreferenceType description describes a time or place or a time and place combination that can be associated with a particular set of browsing or filtering and search preferences. For example, a user may have preferences for different broadcast sports programs during different seasons of the year. Similarly, a user may have preference for programs in English language when the user is traveling in Japan.

14.1.4 UsagePreferences DS The UsagePreferences DS is used to specify user’s preferences pertaining to filtering, searching and browsing of audio-visual content.




14.1.4.4 Description Example<UserPreference> <UserIdentifier protection="true" userName="John"/> <UsagePreferences allowAutomaticUpdate="false"> <FilteringAndSearchPreferences protection="true"> . . . </FilteringAndSearchPreferences> <BrowsingPreferences protection="true"> . . . </BrowsingPreferences> </UsagePreferences></UserPreference>

14.1.4.5 Description UseThe UsagePreferences DS is used to specify user’s preferences pertaining to filtering, searching and browsing of audio-visual content. Filtering and search preferences describe, for example, favorite titles, genre and actors. Browsing preferences describe preferred views of favorite programs where preferences may be dependent on usage conditions such as available bandwidth or the time the user has to consume the information. A typical example application is depicted in Figure 46 where usage preferences are used to perform personalized filtering and viewing. In this case, the system filters media descriptions to find media that fits user’s filtering and search preferences. Preferred media and their descriptions may be stored in a local storage. User then navigates and accesses different views of the media according to user’s browsing preferences.

Figure 46: Personalized filtering, search and browsing of audio-visual content.

14.1.5 BrowsingPreferences DS The BrowsingPreferences DS is used to specify user preferences pertaining to navigation of and access to media.




14.1.5.4 Description Example<UserPreference> <UserIdentifier protection="true" userName="Mike"/> <UsagePreferences allowAutomaticUpdate="false"> <BrowsingPreferences protection="true" preferenceValue="8"> <SummaryPreferences> . . . </SummaryPreferences> <BrowsingPreferenceType> . . .

</BrowsingPreferenceType> </BrowsingPreferences> </UsagePreferences></UserPreference>

14.1.5.5 Description UseThe BrowsingPreferences DS is used to specify user preferences pertaining to nonlinear navigation of and access to media, depending on usage conditions, user’s interest and the time the user is willing to spend to consume the content.

14.1.6 SummaryPreferences DS The SummaryPreferences DS is used to specify user’s preferences for visualization and sonification of particular AV content.




14.1.6.4 Description ExampleA simple example is as follows.

<SummaryPreferences> <SummaryTypePreference>keyEvents</SummaryTypePreference> <PreferredSummaryTheme>Free-kicks</PreferredSummaryTheme> <PreferredSummaryTheme>Goals</PreferredSummaryTheme></SummaryPreferences>

Another example where a user prefers to visualize summaries based on key-videoclips while in the office, but prefers a visualization based on a limited number of key-frames while on the train or in the car, is as follows.

</UserPreference> <UserIdentifier protection="true" userName="Jimmy"/> <UsagePreferences allowAutomaticUpdate="false"> <BrowsingPreferences protection="true"> <SummaryPreferences> <SummaryTypePreference>keyVideoClips</SummaryTypePreference> <MinSummaryDuration><m>3</m><s>20</s></MinSummaryDuration> <MaxSummaryDuration><m>6</m><s>40</s></MaxSummaryDuration> </SummaryPreferences> <BrowsingPreferenceType> <Place> <PlaceName xml:lang="en">Office</PlaceName> </Place> <Time> <TimePoint> <h>8</h> </TimePoint> <Duration> <No_h>8</No_h> </Duration> </Time> </BrowsingPreferenceType> </BrowsingPreferences>

<BrowsingPreferences protection="true"> <SummaryPreferences> <SummaryTypePreference>keyFrames</SummaryTypePreference> <MaxNumOfKeyframes>50</MaxNumOfKeyframes> </SummaryPreferences> <BrowsingPreferenceType> <Place> <PlaceName xml:lang="en">Train</PlaceName> </Place> </BrowsingPreferenceType>

<BrowsingPreferenceType> <Place> <PlaceName xml:lang="en">Car</PlaceName> </Place> </BrowsingPreferenceType> </BrowsingPreferences> </UsagePreferences></UserPreference>

Editor's Note: The time description should be updated.

14.1.6.5 Description UseThe SummaryPreferences DS is used to specify user’s preferences for nonlinear navigation and access to media. User’s can specify their preferences for multiple alternative views of a particular audio-visual program that fit best to their desire and constraints. For example, a user may prefer a key-frame summary containing a limited number of key-frames over a video highlight summary during a usage mode with low-bandwidth mobile connection to a media server. Similarly, a user may prefer audio skims of a particular duration as he/she experiences the information in his/her car.

14.1.7 FilteringAndSearchPreferences DS The FilteringAndSearchPreferences DS specifies users’ filtering or searching preferences in terms of creation and classification aspects of the AV content.

14.1.7.1 Description Scheme Syntax

<element name="FilteringAndSearchPreferences" type="mds:FilteringAndSearchPreferences"/>

<complexType name="FilteringAndSearchPreferences"><element name="ClassificationPreferences"

type="mds:ClassificationPreferences" minOccurs="0" maxOccurs="unbounded"/>

<element name="CreationPreferences" type="mds:CreationPreferences" minOccurs="0" maxOccurs="unbounded"/> <element name="KeywordPreferences" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:TextualDescription" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType>

</element><element name="SourcePreferences" type="mds:SourcePreferences"

minOccurs="0" maxOccurs="unbounded"/><element name="FilteringAndSearchPreferenceType"

type="mds:PreferenceType" minOccurs="0" maxOccurs="unbounded"/> <element name="FilteringAndSearchPreferences" type="mds:FilteringAndSearchPreferences" minOccurs="0" maxOccurs="unbounded"/>

<attribute name="protection" type="boolean" use="default" value="true"/>

<attribute name="preferenceValue" type="integer" use="optional"/>

</complexType>

Note: The elements other than KeywordPreferences in the above specification are also included in the MDS WD.

Editor's Note: The MediaReview type is not defined.


Name DefinitionFilteringAndSearchPreferences Describes user’s preferences for filtering of AV content or searching

for preferred AV content. A FilteringAndSearchPreferences element may optionally contain other FilteringAndSearchPreferences elements as its children, to specify hierarchically structured preferences. In this case, the filtering and search preferences of the children elements are conditioned on the preferences contained in their parent node.

protection Describes user’s desire to keep filtering and search preferences private.

preferenceValue Describes the relative priority or weight assigned to a particular filtering and search preference description, in case multiple filtering and search preference descriptions are present.

ClassificationPreferences Describes the user’s preference related to media classification descriptions.

CreationPreferences Describes the user’s preference related to media creation descriptions.

KeywordPreferences Describes user’s preferred keywords.SourcePreferences Describes the user’s preference for a particular source of media. FilteringAndSearchPreferenceType Identifies the usage condition(s) for a particular filtering and search

preference description.


14.1.7.4 Description ExampleThe following is a basic example of filtering and search preferences.

</UserPreference> <UserIdentifier protection="true" userName="Kim"/> <UsagePreferences allowAutomaticUpdate="false"> <FilteringAndSearchPreferences protection="true" preferenceValue="100"> <PreferenceValue>100</PreferenceValue> <ClassificationPreferences> . . . </ClassificationPreferences> <CreationPreferences> . . . </CreationPreferences> <KeywordPreferences xml:lang="en">artificial vision</KeywordPreferences>

<SourcePreferences> . . .</SourcePreferences>

<FilteringAndSearchPreferenceType> . . . </FilteringAndSearchPreferenceType> </FilteringAndSearchPreferences> </UsagePreferences></UserPreference>

The following is an example of using recursive structure of the FilteringAndSearchPreferences:

<FilteringAndSearchPreferences protection="true" preferenceValue="9"><SourcePreferences>

<PublicationChannel> KBS1 </PublicationChannel></SourcePreferences><FilteringAndSearchPreferences protection="true" preferenceValue="8">

<ClassificationPreferences><Genre> News </Genre>

</ClassificationPreferences><FilteringAndSearchPreferences protection="true"

preferenceValue="7"><CreationPreferences>

<Creator> Tom Brokite </Creator></CreationPreferences>

</FilteringAndSearchPreferences><FilteringAndSearchPreferences protection="true"

preferenceValue="9"><CreationPreferences>

<Creator> J. Pack </Creator></CreationPreferences>

</FilteringAndSearchPreferences></FilteringAndSearchPreferences><FilteringAndSearchPreferences protection="true" preferenceValue="7">

<ClassificationPreferences><Genre> Documentary </Genre>

</ClassificationPreferences></FilteringAndSearchPreferences><FilteringAndSearchPreferences protection="true" preferenceValue="6">

<ClassificationPreferences><Genre> Sports </Genre>

</ClassificationPreferences> <FilteringAndSearchPreferences protection="true" preferenceValue="6">

<ClassificationPreferences><Genre> Soccer </Genre>

</ClassificationPreferences></FilteringAndSearchPreferences>

<FilteringAndSearchPreferences protection="true" preferenceValue="5">

<ClassificationPreferences><Genre> Baseball </Genre>

</ClassificationPreferences></FilteringAndSearchPreferences>

</FilteringAndSearchPreferences></FilteringAndSearchPreferences>

14.1.7.5 Description UseThe FilteringAndSearchPreferences DS is used for automatic content filtering or searching according to user preferences. The Filtering and Search Preferences can be used, for example, to build customized electronic program guides. When combinations of ClassificationPreferences, CreationPreferences, KeywordPreferences and/or SourcePreferences are used, the preferenceValue attribute of FilteringAndSearchPreferences gives the relative preference value of the particular filtering and search preference description of the combined preferences, not each individual sub preference. In the above example, the user has specified "artificial vision" as keyword preferences. The user may be following the developments in the area of artificial vision for the blind. A filtering application may match this phrase or its strings with textual fields of media descriptions, such as title, creation description etc. The matching methodology is non-normative and depends on user’s filtering application.

14.1.8 ClassificationPreferences DS The ClassificationPreferences DS is used to specify user preferences related to classification of the content, i.e., preferred genre, preferred country of origin or preferred language.






<complexType name="ClassificationPreferences"> <element name="Country" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:Country" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="Language" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:Language" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="Genre" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:ControlledTerm" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="PackagedType" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:ControlledTerm" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="MediaReview" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:MediaReview" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <attribute name="id" type="ID"/> <attribute name="preferenceValue" type="integer" use="optional"/>

</complexType>

Note: The elements other than MediaReviewPreferences in the above specification are also included in the MDS WD.


Note: The semantics of MediaReview are defined below. Semantics of all other elements are specified in the MDS WD associated with the current version of this document. (see Introduction).

Name DefinitionClassificationPreferences Specifies user preferences related to the class of the content. MediaReview Describes user’s preference with respect to reviews of the content. See

Section8.1.3.


14.1.8.4 Description Example<UsagePreferences allowAutomaticUpdate="false"> <FilteringAndSearchPreferences protection="true"> <ClassificationPreferences preferenceValue="100"> <Language> <LanguageCode>en</LanguageCode> </Language> <Genre>Movie</Genre>

<MediaReview> <Reviewer> <Individual> <FamilyName>Ebert</FamilyName> <GivenName>Roger</GivenName> </Individual> </Reviewer> <Rating> <RatingCriterion> <CriterionName>Overall</CriterionName> <WorstRating>1</WorstRating> <BestRating>10</BestRating> </RatingCriterion> <RatingValue>10</RatingValue> </Rating> </MediaReview> </ClassificationPreferences>

. . .<ClassificationPreferences preferenceValue="80">

. . .</ClassificationPreferences>. . .

<FilteringAndSearchPreferenceType> <Place> <PlaceName xml:lang="en">Tokyo</PlaceName> <Country>JP</Country> </Place> </FilteringAndSearchPreferenceType> </FilteringAndSearchPreferences></UsagePreferences>

14.1.8.5 Description UseThe ClassificationPreferences DS is used to specify user’s preferences related to the classification description of the media, such as preferred language or a favorite genre (e.g., "science-fiction movies", "business news" or "pop-music") or preferred country of origin (e.g. music from France). In the above example, the user prefers news programs in English, when he/she is in Japan. In this example, the user also prefers movies reviewed by Roger Ebert and rated at 10, where the rating 10 is the best rating. Multiple instantiations of ClassificationPreferences can be relatively ranked by the application according to the values of their individual preferenceValue attributes.

14.1.9 CreationPreferences DS The CreationPreferences DS is used to specify user preferences related to creation of the content, i.e., favorite titles and favorite actors.

14.1.9.1 Description Scheme Syntax

<complexType name="CreationPreferences"> <element name="Title" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:Title" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="Creator" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:Creator" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="Location" minOccurs="0" maxOccurs="unbounded">

<complexType base="mds:Place" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <element name="DatePeriod" minOccurs="0" maxOccurs="unbounded"> <complexType base="mds:Time" derivedBy="extension"> <attribute name="preferenceValue" type="integer" use="optional"/> </complexType> </element> <attribute name="id" type="ID"/> <attribute name="preferenceValue" type="integer" use="optional"/></complexType>

Note: The elements other than DatePeriod in the above specification are also included in the MDS WD.

14.1.9.2 </complexType>Description Scheme SemanticsThe semantics of DatePeriod are defined below. Semantics of all other elements are specified in the MDS WD associated with the current version of this document. (see Introduction).

Name DefinitionCreationPreferences Specifies user preferences related to the creation of the content.DatePeriod Describes user’s preference for a period for the creation date/time of the content.


14.1.9.4 Description Example<UsagePreferences allowAutomaticUpdate="false"> <FilteringAndSearchPreferences protection="true"> <CreationPreferences> <Creator preferenceValue="80"> <Individual> <FamilyName>Ryan</FamilyName> <GivenName>Meg</GivenName> </Individual> <role>actress</role> </Creator> <Creator preferenceValue="90"> <Individual> <FamilyName>Diaz</FamilyName> <GivenName>Cameron</GivenName> </Individual> <role>actress</role> </Creator> <Creator preferenceValue="100"> <Individual> <FamilyName>Ford</FamilyName> <GivenName>Harrison</GivenName></Individual> <role>actor</role> </Creator> <DatePeriod> <TimePoint> <Y>1995</Y> </TimePoint> <Duration> <No_Y>5</No_Y> </Duration> </DatePeriod>

</CreationPreferences> </FilteringAndSearchPreferences> </UsagePreferences></UsagePreferences>

Editor's Note: Time related entities in this example should be updated according to latest Time DSs.

14.1.9.5 Description UseThe CreationPreferences DS is used to specify user’s preferences related to the creation description of the media, such as preference on a particular title, or a favorite actor, or period of time within which the content was created. In the above example, the user is expressing preference for content that is created between the years 1995 and 2000 featuring preferred actors or actresses, each with a relative preferenceValue attribute.

14.1.10 SourcePreferences DS The SourcePreferences DS is used to specify preferences for the source of the media, such as its medium.




14.1.10.4 Description Example<UsagePreferences allowAutomaticUpdate="false"> <FilteringAndSearchPreferences protection="true"> <CreationPreferences> <Title type="original"> <TitleText xml:lang="en">Star Trek</TitleText> </Title> </CreationPreferences> <SourcePreferences> <PublicationType>Terrestrial Broadcast</PublicationType> <PublicationDate><Y>2000</Y><M>5</M><D>23</D></PublicationDate> </SourcePreferences> </FilteringAndSearchPreferences></UsagePreferences>

Editor's Note: Time related entities in this example should be updated according to latest Time DSs.

14.1.10.5 Description UseThe SourcePreferences DS is used to specify preference for sources of the media, e.g., terrestrial versus satellite. If the user has access to broadcast only, the user may not be interested in program offerings from the satellite channels. In the above example, the user has a preference for "Star Trek" programs that are available from terrestrial broadcast on the day of May 23, 2000. Such preference is useful in creating personalized electronic program guides.

15 Bibliography

[1] Special issue on Object Based Video Coding and Description, IEEE Transactions on Circuits and Systems for Video Technology, 9(8), December 1999.

[2] L. Agnihotri and N. Dimitrova, "Text Detection for Video Analysis", Workshop on Content Based Image and Video Libraries, held in conjunction with CVPR, Colorado, pp. 109-113, 1999.

[3] Y. Abdeljaoued , T. Ebrahimi, C. Christopoulos, I. Mas Ivars, "A new algorithm for shot boundary detection", Proceedings European Signal Processing Conference (EUSIPCO 2000), Special session on Multimedia Indexing, Browsing and Retrieval, 5-8 September 2000, Tampere, Finland.

[4] M. Bierling, "Displacement Estimation by Hierarchical Block Matching", SPIE Vol 1001, Visual Communication & Image Processing, 1988.

[5] A. D. Bimbo, E. Vicario and D. Zingoni, "Symbolic description and visual querying of image sequences using spatio-temporal logic", IEEE Transactions on Knowledge and Data Engineering, Vol 7, No. 4, August, 1995.

[6] Björk N. and Christopoulos C., "Transcoder Architectures for video coding", Proceedings of IEEE International Conference on Acoustic Speech and Signal Processing (ICASSP 98), Seattle, Washington, Vol. 5, pp. 2813-2816, May 12-15, 1998.

[7] S.-K. Chang, Q. Y. Shi, and C. Y. Yan, "Iconic indexing by 2-D strings", IEEE Trans. Pattern Analysis Machine Intell., 9(3):413-428, May 1987.

[8] V.N. Gudivada and V.V. Raghavan, "Design and Evaluation of ALgorithms for Image Retrieval by Spatial Similarity", ACM Transaction on Information Systems, Vol.13,No,2,April 1995, pp.115-144.

[9] A. Hanjalic, H.J. Zhang, an integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis, IEEE Transactions on Circuits and Systems for Video Technology 9(8): 1280-1289, December 1999.

[10] M. Kass, A. Witkin and D. Terzopoulos, "Snakes: Active contour models", International Journal of Computer Vision, pp 321-331, 1988.

[11] M. Kim, J.G. Choi, D. Kim, H. Lee, M.H. Lee, C. Ahn, Y.S. Ho, A VOP Generation tool: automatic segmentation of moving objects in image sequences based on spatio-temporal information, IEEE Transactions on Circuits and Systems for Video Technology 9(8): 1216-1226, December 1999.

[12] S. Herrmann, H. Mooshofer, HY. Dietrich, W. Stechele, A video segmentation algorithm for hierarchical object representation and its implementation, IEEE Transactions on Circuits and Systems for Video Technology 9(8): 1204-1215, December 1999.

[13] Jain AK, Dubes RC. Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ, 1988.[14] T. Meier, K. N. Ngan, Video segmentation for content based coding, IEEE Transactions on Circuits and Systems

for Video Technology 9(8): 1190-1203, December 1999.[15] J. Meng, Y. Juan and S.-F. Chang, Scene Change Detection in a MPEG Compressed Video Sequence, Proceedings,

IS&T/SPIE's Symposium on Electronic Imaging: Science & Techno logy (EI'95) – Digital Video Compression: Algorithms and Technologies, San Jose, February 1995.

[16] A. Perkis, Y. Abdeljaoued , C. Christopoulos, T. Ebrahimi, J. Chicharo, "Universal Multimedia Access from Wired and Wireless Systems", submitted to Circuits, Systems and Signal Processing, Special Issue on Multimedia Communication Services, 2000.

[17] P. Salembier and F. Marqués, Region-based representation of image and video: Segmentation tools for multimedia services, IEEE Transactions on Circuits and Systems for Video Technology 9(8): 1147-1169, December 1999.

[18] J-C. Shim, C. Dorai, and R. Bolle, "Automatic Text Extraction from Video for Content-Based Annotation and Retrieval," in Proc. of the Int. Conference on Pattern Recognition, pp. 618-620, August 1998.

[19] J.-C. Shim and C. Dorai, "A Fast and Generalized Region Labeling Algorithm," in Proc. of the Int. Conference on Image Processing, October 1999.

[20] H. Wallin , C. Christopoulos, A. Smolic, Y. Abdeljaoued, T. Ebrahimi, "Robust mosaic construction algorithm", ISO/IEC JTC1/SC29/WG11 MPEG00/M5698, Noordwijkerhout, The Netherlands, March 2000.

[21] World Wide Web Consortium (W3C), "Synchronized Multimedia", http://www.w3.org/AudioVideo/[22] D. Zhong and S.-F. Chang, "AMOS: An Active System For MPEG-4 Video Object Segmentation", 1998

International Conference on Image Processing, October 4-7, 1998, Chicago, Illinois, USA[23] D. Zhong and S.-F.Chang, "Video Object Model and Segmentation for Content-Based Video Indexing", ISCAS’97,

HongKong, June 9-12, 1997 [24] D. Zhong and S.-F.Chang, "Spatio-Temporal Video Search Using the Object Based Video Representation",

ICIP'97, October 26-29, 1997 Santa Barbara, CA[25] D. Zhong and S.-F. Chang, "Region Feature Based Similarity Searching of Semantic Video Objects", ICIP'99, Oct

24-28 1999, Kobe, Japan.

The following References should be introduced in the text References related to Importance Hint and region of Interests

http://www.w3.org/AudioVideo/

C. Christopoulos, J. Askelof and M. Larsson, "Efficient methods for encoding regions of interest in the upcoming JPEG2000 still image coding standard", IEEE Signal Processing letters, 2000 (submitted).

C. Christopoulos, J. Askelof and M. Larsson, "Efficient region of interest encoding techniques in the upcoming JPEG2000 still image coding standard", invited paper to IEEE Int. Conference on Image Processing (ICIP), Special Session on JPEG2000, September 10-13, 2000, Vancouver, Canada.

C. Christopoulos, J. Askelof and M. Larsson, "Efficient encoding and reconstruction of Regions of Interest in JPEG2000", accepted to European Signal processing Conference (EUSIPCO 2000), 5-8 September 2000, Tampere, Finland.

References related to motion hintsBjörk N. and Christopoulos C., "Transcoder Architectures for video coding", IEEE Transactions on Consumer Electronics, Vol. 44, No. 1, pp. 88-98, February 1998. (Received the 3rd Place Chester W. Sall Award for the best papers published in IEEE Trans. on Consumer Electronics).

Björk N. and Christopoulos C., "Transcoder Architectures for video coding", Proceedings of IEEE International Conference on Acoustic Speech and Signal Processing (ICASSP 98), Seattle, Washington, Vol. 5, pp. 2813-2816, May 12-15, 1998.

16 Annex 1: Schema definitionIn this section, the syntax defined in the current document is extracted so that it can be easily used for XM developments and processed by validating parsers.

<schema xmlns="http://www.w3.org/1999/XMLSchema"xmlns:mds=http://www.example.com/???targetNamespace=http://www.example.com/???elementFormDefault="unqualified"attributeFormDefault="unqualified"><simpleType name="probabilityType" base="decimal" dervivedBy=”restriction”><minInclusive value="0.0"/><maxInclusive value="1.0"/></simpleType><simpleType name="confidenceType" base="decimal" derivedBy=”restriction”><minInclusive value="0.0"/><maxInclusive value="1.0"/></simpleType><complexType name="ProbabilityVectorType" base="DoubleVector"derivedBy="restriction"><minInclusive value=”0.0”><maxInclusive value=”1.0”></complexType><complexType name="ProbabilityMatrixType" base="DoubleMatrix"derivedBy="restriction"><minInclusive value=”0.0”><maxInclusive value=”1.0”></complexType><simpleType name="MediaURL" base="uri"/><complexType name="WeightValue"><choice minOccurs="0" maxOccurs="1"><element ref="Reference"/><element name="KeyName" type="string"/></choice><attribute name="ReferenceValue" type="float" use="optional"/></complexType><complexType name="Weight"><element name="WeightValue" type="mds:WeightValue"maxOccurs="unbounded"/><attribute name="id" type="ID"/><attribute name="reliability" type="float"/><attribute name="WeightName" type="string"/></complexType><element name="TemporalInterpolation" type="mds:TemporalInterpolation"/><complexType name="TemporalInterpolation"><choice><element name="WholeInterval" type="mds:Time"/><element name="KeyPointT" type="mds:Time" maxOccurs="unbounded"/></choice><sequence maxOccurs="unbounded">

<element name="KeyPoint" type="float" maxOccurs="unbounded"/><sequence minOccurs="0" maxOccurs="unbounded"><element name="FunctionID"<simpleType base="Integer"/><minInclusive value="-1"/><maxInclusive value="1"/></simpleType></element><element name="FunctionParameterValue"type="float" minOccurs="0"/></sequence></sequence><attribute name="NKeyPoints" type="positiveInteger" use="required"/><attribute name="Dimension" type="positiveInteger" use="required"/></complexType><complexType name="SeriesOfScalarType" abstract="true"><attribute name="nElements" type="nonNegativeInteger" use="default"value="1"/></complexType><complexType name="AudioPowerType" base="AudioSampledType"derivedBy="extension"><element name="Value" type="mds:SeriesOfScalarType"maxOccurs="unbounded"/></complexType><complexType name="SOSScaledType" base="mds:SeriesOfScalarType"derivedBy="extension" abstract="true"><element name="Min" type="mds:FloatVectorType" minOccurs="0"/><element name="Max" type="mds:FloatVectorType" minOccurs="0"/><element name="Mean" type="mds:FloatVectorType" minOccurs="0"/><element name="Random" type="mds:FloatVectorType" minOccurs="0"/><element name="First" type="mds:FloatVectorType" minOccurs="0"/><element name="Last" type="mds:FloatVectorType" minOccurs="0"/><element name="Variance" type="mds:FloatVectorType" minOccurs="0"/><element name="Weight" type="mds:FloatVectorType" minOccurs="0"/></complexType><complexType name="SOSFixedScaleType" base="mds:SOSScaledType"derivedBy="extension"><element name="VarianceScalewise" type="mds:FloatMatrixType"minOccurs="0"/><attribute name="RootFirst" type="boolean" use="default"value="false"/><attribute name="ScaleRatio" type="positiveInteger" use="required"/><attribute name="TotalSamples" type="positiveInteger"use="required"/></complexType><complexType name="SOSFreeScaleType" base="mds:SOSScaledType" derivedBy="extension">

<sequence minOccurs="1" maxOccurs="unbounded"><element name="scaleRatio" type="positiveInteger"/><element name="nElements" type="positiveInteger"/></sequence></complexType><complexType name="SOSUnscaledType" base="mds:SeriesOfScalarType"derivedBy="extension"><element name="Data" type="mds:FloatVectorType" minOccurs="1"/><element name="Weight" type="mds:floatVectorType" minOccurs="0"/><attribute name="MeanFlag" type="boolean" use="default"value="false"/><attribute name="MinFlag" type="boolean" use="default"value="false"/><attribute name="MaxFlag" type="boolean" use="default"value="false"/><attribute name="RandomFlag" type="boolean" use="default"value="false"/><attribute name="FirstFlag" type="boolean" use="default"value="false"/><attribute name="LastFlag" type="boolean" use="default"value="false"/><attribute name="VarianceFlag" type="boolean" use="default"value="false"/><attribute name="VarianceScalewiseFlag" type="boolean" use="default"value="false"/></complexType><complexType name="SeriesOfVectorType" abstract="true"><attribute name="nElements" type="nonNegativeInteger" use="default"value="1"/><attribute name="VectorSize" type="positiveInteger" use="default"value="1"/></complexType><complexType name="AudioSpectrumEnvelope" base="AudioSampledType"derivedBy="extension"><element name="Value" type="mds:SeriesOfVectorType" /></complexType><complexType name="SOVScaledType" base="mds:SeriesOfVectorType"derivedBy="extension" abstract="true"><element name="Min" type="mds:FloatMatrixType" minOccurs="0"/><element name="Max" type="mds:FloatMatrixType" minOccurs="0"/><element name="Mean" type="mds:FloatMatrixType" minOccurs="0"/><element name="Random" type="mds:FloatMatrixType" minOccurs="0"/><element name="First" type="mds:FloatMatrixType" minOccurs="0"/><element name="Last" type="mds:FloatMatrixType" minOccurs="0"/><element name="Variance" type="mds:FloatMatrixType" minOccurs="0"/><element name="Covariance" type="mds:FloatMatrixType" minOccurs="0"/><element name="VarianceSummed" type="mds:FloatVectorType" minOccurs="0"/><element name="MaxSqDist" type="mds:FloatVectorType" minOccurs="0"/><element name="Weight" type="mds:FloatVectorType" minOccurs="0"/></complexType>

<complexType name="SOVFixedScaleType" base="mds:SOVScaledType"derivedBy="extension"><element name="VarianceScalewise" type="mds:FloatMatrixType"minOccurs="0"/><attribute name="RootFirst" type="boolean" use="default"value="false"/><attribute name="ScaleRatio" type="positiveInteger" use="required"/><attribute name="TotalSamples" type="positiveInteger"use="required"/></complexType><complexType name="SOVFreeScaleType" base="mds:SOVScaledType"derivedBy="extension"><sequence minOccurs="1" maxOccurs="unbounded"><element name="scaleRatio" type="positiveInteger"/><element name="nElements" type="positiveInteger"/></sequence></complexType><complexType name="SOVUnscaledType" base="mds:SeriesOfVectorType"derivedBy="extension"><element name="Data" type="mds:FloatMatrixType" minOccurs="1"/><element name="Weight" type="mds:FloatVectorType" minOccurs="0"/><attribute name="MeanFlag" type="boolean" use="default"value="false"/><attribute name="MinFlag" type="boolean" use="default"value="false"/><attribute name="MaxFlag" type="boolean" use="default"value="false"/><attribute name="UniformFlag" type="boolean" use="default"value="false"/><attribute name="RandomFlag" type="boolean" use="default"value="false"/><attribute name="VarianceFlag" type="boolean" use="default"value="false"/><attribute name="CovarianceFlag" type="boolean" use="default"value="false"/><attribute name="VarianceSummedFlag" type="boolean" use="default"value="false"/><attribute name="VarianceScalewiseFlag" type="boolean" use="default"value="false"/><attribute name="MaxSqDistFlag" type="boolean" use="default"value="false"/></complexType><complexType name="MediaProfile"><element name="MediaIdentification" type="mds:MediaIdentification"/><element name="MediaFormat" type="mds:MediaFormat"/><element name="MediaCoding" type="mds:MediaCoding" minOccurs="0"maxOccurs="unbounded"/><element name="MediaInstance" type="mds:MediaInstance" minOccurs="0"maxOccurs="unbounded"/><element name="MediaTranscodingHints" type="mds:MediaTranscodingHints"minOccurs="0" maxOccurs="1"/>

<attribute name="id" type="ID"/></complexType><complexType name="MediaTranscodingHints"><element name="MotionHint" type="mds:MotionHint"minOccurs="0" maxOccurs="1"/><attribute name="DifficultyHint" type="float" use="optional"/><attribute name="Importance" type="float" use="optional"/><attribute name="id" type="ID" use="optional"/></complexType><complexType name="MotionHint" content="empty">—<attribute name="Motion_uncompensability" type="float"/> <attribute name="MotionRangeXLeft" type="integer"/><attribute name="MotionRangeXRight" type="integer"/><attribute name="MotionRangeYBottom" type="integer"/><attribute name="MotionRangeYTop" type="integer"/></complexType><complexType name="Classification"><element ref="Country" minOccurs="0" maxOccurs="unbounded"/><element name="Language" type="language"minOccurs="0" maxOccurs="unbounded"/><element name="Genre" type="mds:ControlledTerm"minOccurs="0" maxOccurs="unbounded"/><element name="PackagedType" type="mds:ControlledTerm"minOccurs="0" maxOccurs="unbounded"/><element name="MediaReview" type="mds:MediaReview" minOccurs="0" maxOccurs="unbounded" /><attribute name="id" type="ID"/></complexType><complexType name="Reviewer" base="mds:Person" derivedBy="extension"><element name="role" type="mds:ControlledTerm" minOccurs="0"/></complexType><complexType name="MediaReview"><element ref="Reviewer" type="mds:Reviewer"minOccurs="1" maxOccurs="1"/><element name="RatingValue" type="integer"minOccurs="1" maxOccurs="1"/><element name="RatingCriterion" minOccurs="1" maxOccurs="1"><complexType><element name="CriterionName"type="mds:TextualDescription" /><element name="WorstRating" type="integer" /><element name="BestRating" type="integer" /></complexType></element><element name="FreeTextReview" type="mds:TextualDescription"minOccurs="0" maxOccurs="unbounded"/><attribute name="id" type="ID"/></complexType>

<simpleType name="TextDataType" base="string"><enumeration value="Superimposed"/><enumeration value="Scene"/></simpleType><complexType name="VideoText" base="mds:MovingRegion" derivedBy ="extension"><element name="Text" type="mds:TextualDescription"minOccurs="0" maxOccurs="1"/><attribute name="TextType" type="mds:TextDataType" use="optional"/><attribute name="FontSize" type="positiveInteger" use="optional"/><attribute name="FontType" type="string" use="optional"/></complexType><simpleType name="ZeroToOneDecimalDataType" base="decimal"><minInclusive value="0.0"/><maxInclusive value="1.0"/></simpleType><complexType name="MatchingHintValue" base="mds:ZeroToOneDecimalDataType"derivedBy="extension"><attribute name="idref" type="IDREF" use="optional"/><attribute name="DescriptorName" type="string" use="optional"/></complexType><complexType name="MatchingHint"><element name="MatchingHintValue" type="mds:MatchingHintValue"minOccurs="1" maxOccurs="unbounded"/><attribute name="id" type="ID" use="optional"/><attribute name="Reliability" type="mds:ZeroToOneDecimalDataType"use="default" value="1.0"/></complexType><complexType name="PrimitiveImportanceType"base="mds:ZeroToOneDecimalDataType" derivedBy="extension"><attribute name="idref" type="IDREF" use="optional"/></complexType><complexType name="PointOfView"><element name="Info" type="mds:StructuredAnnotation"minOccurs="0" maxOccurs="1"/><element name="Value" type="mds:PrimitiveImportanceType"minOccurs="1" maxOccurs="unbounded"/><attribute name="id" type="ID" use="optional"/><attribute name="ViewPoint" type="string" use="required"/></complexType><simpleType name="NormalizedType" base="string"><enumeration value="None"/><enumeration value="ByPeak"/><enumeration value="ByTotal"/></simpleType><simpleType name="AffectType" base="string"><enumeration value="Interested"/>

<enumeration value="Excited"/><enumeration value="Bored"/><enumeration value="Surprised"/><enumeration value="Sad"/><enumeration value="Hateful"/><enumeration value="Angry"/><enumeration value="Expectant"/><enumeration value="Happy"/><enumeration value="Scared"/><enumeration value="StoryComplication"/><enumeration value="StoryShape"/></simpleType><element name="Affective"><complexType><element name="Score" minOccurs="1" maxOccurs="unbounded"><complexType base="mds:ZeroToOneDecimalDataType"derivedBy="extension"><attribute name="idref" type="IDREF"use="optional"/></complexType></element><element name="Info" type="mds:StructuredAnnotation"minOccurs="0" maxOccurs="1"/><attribute name="id" type="ID" use="optional"/><attribute name="Confidence"type="mds:ZeroToOneDecimalDataType"use="default" value="1.0"/><attribute name="AffectValue" type="mds:AffectType"use="required"/><attribute name="Normalized" type="mds:NormalizedType"use="default" value="None"/></complexType></element><complexType name="SoundProperty"><element name="Title" type="mds:TextualDescription" minOccurs="0"/><element name="RefLocator" type="mds:MediaLocator" minOccurs="0"/><element name="RefTime" type="mds:MediaTime" minOccurs="0"/><element name="SyncTime" type="mds:MediaTime" minOccurs="0"/><element name="SoundLocator" type="mds:SoundLocator" minOccurs="0"/></complexType><complexType name="TextProperty"><choice><element name="RefLocator" type="mds:MediaLocator" minOccurs="0"/><element name="RefTime" type="mds:MediaTime" minOccurs="0"/></choice><element name="SyncTime" type="mds:MediaTime" minOccurs="0"/><element name="FreeText" type="mds:TextualDescription"minOccurs="0"/></complexType><element name="FilteringAndSearchPreferences"type="mds:FilteringAndSearchPreferences"/><complexType name="FilteringAndSearchPreferences"><element name="ClassificationPreferences"

type="mds:ClassificationPreferences"minOccurs="0" maxOccurs="unbounded"/><element name="CreationPreferences"type="mds:CreationPreferences"minOccurs="0" maxOccurs="unbounded"/><element name="KeywordPreferences"minOccurs="0" maxOccurs="unbounded"><complexType base="mds:TextualDescription"derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="SourcePreferences" type="mds:SourcePreferences"minOccurs="0" maxOccurs="unbounded"/><element name="FilteringAndSearchPreferenceType"type="mds:PreferenceType"minOccurs="0" maxOccurs="unbounded"/><element name="FilteringAndSearchPreferences"type="mds:FilteringAndSearchPreferences"minOccurs="0" maxOccurs="unbounded"/><attribute name="protection" type="boolean" use="default"value="true"/><attribute name="preferenceValue" type="integer"use="optional"/></complexType><complexType name="ClassificationPreferences"><element name="Country" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:Country" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="Language" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:Language" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="Genre" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:ControlledTerm" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="PackagedType" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:ControlledTerm" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="MediaReview" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:MediaReview" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><attribute name="id" type="ID"/><attribute name="preferenceValue" type="integer" use="optional"/></complexType>

<complexType name="CreationPreferences"><element name="Title" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:Title" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="Creator" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:Creator" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="Location" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:Place" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><element name="DatePeriod" minOccurs="0" maxOccurs="unbounded"><complexType base="mds:Time" derivedBy="extension"><attribute name="preferenceValue" type="integer"use="optional"/></complexType></element><attribute name="id" type="ID"/><attribute name="preferenceValue" type="integer" use="optional"/></complexType>

Documents

INTERNATIONAL ORGANIZATION FOR …read.pudn.com/downloads75/doc/279385/15938/Multimedia... · Web viewThe TimeUnit is specified as 1N which is 1/30 of a second according to 30F. But