10
952 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September John Grant, conference chair Metadata for Audio A A E E S S 2 2 5 5 t t h h I I N N T T E E R R N N A A T T I I O O N N A A L L C C O O N N F F E E R R E E N N C C E E London, UK June 17–19, 2004

AEESS2255tthh IINNTTEERRNNAATTIIOONNAALL CCOONNFFEERREENNCCEE · Following this Kate Grant of Nine Tiles provided delegates ... Some unusual presentations in this series included

Embed Size (px)

Citation preview

952 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September

John Grant, conference chair

Metadata for Audio

AAAAEEEESSSS 22225555tttthhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLLCCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

London, UKJune 17–19, 2004

J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 953

Metadata is the key to managing audio informationand making sure that audiences know what isavailable in the way of audio program material.

This was the theme for the AES 25th InternationalConference held from June 17–19 at Church House inLondon, drawing an audience of over 100 delegates frombroadcasting, universities, digital libraries, and audioprogram-management operations from around the world.Chaired by John Grant, with papers cochairs Russell Masonand Gerhard Stoll, the conference spanned topics such asfeature extraction, frameworks, broadcast implementations,libraries and archives, and delivery.

I’M AN AUDIO ENGINEER, TEACH ME ABOUTMETADATA!Since metadata is not audio at all, but rather structures forinformation about audio, it lies squarely in the domain ofinformation professionals. For this reason audio engineerscan find themselves at something of a loss when it comes tounderstanding what is going on in metadata systems andprotocols. The metadata world is full of confusingacronyms like SMEF, P-META, MPEG-7, XML, AAF,MXF and so forth, representing a minefield for the unwary.Addressing the need for education in this important area,the first day of the conference was designed as a tutorialprogram, containing papers from key presenters on some ofthe basic principles of metadata.

John Grant, conference chair, welcomed everyone andthen introduced the BBC’s Chris Chambers who discussedidentities and handling strategies for metadata, helping theconference delegates to understand the ways in whichsystems can find and associate the elements of a singleprogram item. He outlined some of the standards beingintroduced by the industry to deal with this issue, neatlysetting the scene for the following speakers who woulddescribe the various approaches in greater detail.

Philip de Nier from BBC R&D proceeded to describe theAdvanced Authoring Format (AAF) and Material ExchangeFormat (MXF), which are both file formats for the storageand exchange of audio–visual material together with variousforms of metadata describing content and composition. Thiswas followed by an XML primer given by Adam Lindsayfrom Lancaster University, UK. XML is a very widely usedmeans of encoding metadata within so-called schema.Continuing the theme of file formats, John Emmett proceed-ed to discuss ways of “keeping it simple,” describing thebroadcast wave format (BWF) and AES31, which is anincreasingly important standard for the exchange of audiocontent and edit lists, using BWF as an audio file structure.

Russell Mason chaired Practical Schemes, the secondsession. Philippa Morrell, industry standards manager of theBOSS Federation in the UK, took delegates through the rela-tively obscure topic of registries.These are essentially ➥

Pho

tos

cour

tesy

Pau

l Tro

ught

on (

http

://p

hoto

s.tr

ough

ton.

org)

954 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September

databases that are administered centrally so as to avoid therisk of duplication and consequent misidentification ofcontent. The issue of classification and taxonomy of soundeffects is also something that requires wide agreement, andthe management of this was outlined by Pedro Cano and hiscolleagues from the Music Technology Group at theUniversity of Pompeu Fabra in Spain. Cano spoke about aclassification scheme inspired by the MPEG-7 standard andbased on an existing lexical network called WordNet, inwhich an extended semantic network includes the semantic,perceptual, and sound-effects-specific terms in an unam-biguous way.

Richard Wright of BBC Information and Archives waskeen to point out that the term metadata is not used in thesame way by audio and video engineers as it is in standarddefinitions. Metadata, correctly defined, is the structure ofinformation, or to more strictly refine the definition, theorganization, naming, and relationships of the descriptiveelements. In other words it is everything but data itself, it isbeyond data. Describing so-called core metadata, he showedthat this could be either the lowest common denominator ora complete description that fits the most general case. Thefirst has limitations in what it can do and the second doesnot work; however, the first lends itself to standardizationbecause it fits the requirements of being essential, general,simple, and popular.

Ending the tutorial day, Geoffroy Peeters from IRCAM inFrance chaired a workshop on MPEG-7, a key internationalstandard involving metadata. There were four presentationson different aspects of the standard, including the first on

managing large sound databases, given by Max Jacob ofIRCAM, Paris, and a second on integrating low-level meta-data in multimedia database management systems, given byMichael Casey of City University, London. Emilia Gómez,Oscar Celma, and colleagues from the University ofPompeu Fabra described tools for content-based retrievaland transformation of audio using MPEG-7, concentratingon the so-called SPOffline and MDTools. SPOffline standsfor Sound Palette Offline and is an application based onMPEG-7 for editing, processing, and mixing intended forsound designers and professional musicians. MDTools wasdesigned to help content providers in the creation andproduction phases for metadata and annotation of multime-dia content. Finishing this session, Jürgen Herre of theFraunhofer Institute in Germany gave a tour of the issuessurrounding the use of MPEG-7 low-level scalability.

METADATA FRAMEWORKS AND TOOLKITSThe first main conference day following the tutorialsessions began with a discussion of frameworks for meta-data. First, Andreas Ebner of IRT introduced data modelsfor audio and video production, and then Wes Curtis andAlison McCarthy of BBC TV outlined the EBU’s P/Metascheme for the exchange of program data between businesses.

The Digital Media Project is led by Leonardo Chariglione,mastermind of the MPEG standardization forums, and thisexciting project was reviewed at the conference by RichardNicol from Cambridge-MIT Institute. He spoke about theneed to preserve end-to-end management of digital content

AAAAEEEESSSS 22225555tttt hhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLL CCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

INVITED AUTHORS

Clockwise, from top left: Max Jacob, Jügren Herre,Emilia Gómez, Michael Casey, Wes Curtis, OscarCelma, Richard Wright, and Alison McCarthy.

J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 955

in a world where such content is potentially widely distribut-ed and easy access is needed to the information about it.Following this Kate Grant of Nine Tiles provided delegateswith an interesting summary of the “what and why” of theMPEG-21 standard, which is essentially a digital rightsmanagement structure. “The 21st Century is a newInformation Age, and the ability to create, access, and usemultimedia content anywhere at any time will be the fulfill-ment of the MPEG-21 vision,” she proclaimed.

Turning to the issue of spatial scene description,Guillaume Potard of the University of Wollongong inAustralia introduced an XML-based 3-D audio-scene meta-data scheme. He explained that existing 3-D audio stan-dards such as VRML (X3D) have very basicsound-description capabilities, but that MPEG-4 hadpowerful 3-D audio features. However, the problem is thatthese are based on a so-called “scene graph” with connect-ed nodes and objects, which can get really complex evenfor simple scenes. Such structures are not optimal for meta-data and cannot easily be searched. Consequently he basesthe proposed XML-based scheme on one from CSound,using “orchestra” and “score” concepts, whereby theorchestra describes the content or objects and the scoredescribes the temporal behavior of the objects.

POSTERSA substantial number of poster papers were presentedduring the 25th International Conference, almost allconcerned with the topic of feature extraction and auto-matic content recogni t ion. Many of the posters

described methods by which different musical featurescan be analyzed and described, such as instrumentrecognition, song complexity, percussion information,and rhythmic meter.

Some unusual presentations in this series included ameans of retrieving spoken documents in accordance withthe MPEG-7 standard, given by Nicolas Moreau andcolleagues from the Technical University of Berlin, and anopera information system based on MPEG-7, presented byOscar Celma from Pompeu Fabra.

FEATURE EXTRACTIONThe topic of feature extraction is closely related to that ofmetadata. It also has a lot in common with the field ofsemantic audio analysis—the automatic extraction of mean-ing from audio. Essentially, feature extraction is concernedwith various methods for identifying important aspects ofaudio signals, either in the spectral, temporal, or spatialdomains, searching for objects and their characteristics anddescribing them in some way.

A variety of work in this field was described in paperspresented during the afternoon of the first conference day,many of which related to the analysis of music signals. Forexample, Claas Derboven of Fraunhofer presented a paperon a system for harmonic analysis of polyphonic music,based on a statistical approach that examines the frequencyof occurrence of musical notes for determining the musicalkey. Chords are determined by a pattern-matching algo-rithm, and tonal components are determined by using anonuniform frequency transform. ➥

AAAAEEEESSSS 22225555tttt hhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLL CCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

INVITED AUTHORS

Clockwise, from top left: Philippa Morrell, Philipde Nier, Richard Nicol, Kate Grant, ChrisChambers, Andreas Ebner, John Emmett, andAdam Lindsay.

956 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September

Bee-Suan Ong was interested to find out how representa-tive sections of music can be identified automatically. Sheused the example of the Beatles’ “Hard Day’s Night” to showhow humans can quickly name the title after hearing veryshort but easily recognizable bits of the song, whereasmachines have to be shown how to do this. Musical itemshave to be segmented so as to separate the parts of a song;this can be done either using a model involving a trainingphase or an approach that is training-model free.

Christian Uhle of Fraunhofer looked at the matter ofdrum-pattern-based genre classification from popular music.The method involves the transcription of percussiveunpitched instruments using independent subspace analysisand extraction of amplitude envelopes from each spectralprofile. Note onsets corresponding to each percussive instru-ment are detected using extraction of amplitude envelopes.Continuing on the same theme of musical rhythm, a paperfrom Fabien Gouyon and his colleagues dealt with the eval-uation of rhythmic descriptors for musical genre classifica-tion. This involved an attempt to detect the rhythmic patternand tempo from an audio signal, followed by training andmatching using a database of typical ballroom dancingnumbers with specific genres such as jive, quickstep, andtango. In the final paper on feature extraction OliverHellmuth and colleagues from Fraunhofer described aprocess of music-genre estimation from low-level audiofeatures; in this case attempting to detect the genre fromnine broad possibilities including blues, classical, country,and so forth, using MPEG-7 low-level features such as spec-tral flatness, spectral crest factor, sharpness, and audio spec-trum envelope. They found that the correct genre wasassigned for 91.4% of all items.

BROADCASTING IMPLEMENTATIONSOne of the primary application areas for metadata struc-tures is in broadcasting, where multiple users need to beable to store and access program material in a way that canenhance the production and distribution process.Broadcasters have typically been among the first to developand embrace the various standards for audio metadata. Thesecond full conference day involved a morning devoted tothis important topic.

Shigeru Aoki, from Tokyo FM Broadcasting, presented apaper on audio metadata in broadcasting that advocated a fileformat containing the audio and metadata within one file.

AAAAEEEESSSS 22225555tttt hhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLL CCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

Excellent lecture facilities were provided by Church House; located in the heart of London, it also hosts the annual meeting of theChurch of England.

Q&A sessions after eachpresentation at the conferenceafforded attendees theopportunity to get furtherclarification from authors andto offer their own perspectiveon the topic: ClaudiaSchremmer and MarkPlumbley are shown here.

Rohde & Schwarz presents the new R&S®UPV AudioAnalyzer. Its performance makes it the new referencestandard – there is simply nothing else like it.The R&S®UPV pushes the limits of technology forbroadcasting and R&D, and is an extremely fastinstrument for the production line. It handles highresolution digital media just as easily as analogmeasurements up to 250 kHz.With all of its performance the Windows XP-basedR&S®UPV is simple to operate. It’s also expandable,and remarkably affordable. Take a look at its capabili-ties, and then contact us to find out more.

◆ For all interfaces – analog, digital and combined◆ Real two-channel signal processing for maximum

measurement performance◆ Digital AES/EBU interface with sampling rate

up to 192 kHz◆ 250 kHz analysis bandwidth◆ Recording and replaying of audio signals◆ Overlapping FFT analysis◆ Expandable, with easy addition of further audio

interfaces like I2S◆ Built-in PC with Windows XP operating system◆ Intuitive, ergonomically sophisticated user

interface for efficient operation

Pure audio performance

The R&S®UPV is the most advanced audio analyzer in the world

www.upv.rohde-schwarz.com

958 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September

He concluded that the Broadcast WAVE file is suitable forthis purpose and that cue-sheet metadata, for example, can bestored in an <adtl> chunk so that it remains with the audio.

Joe Bull of SADiE and Kai-Uwe Kaup of VCS spokeabout integrated multimedia in the broadcast environment,looking at concepts of workflow, media assets, and genealo-gy. Genealogy, for example, is concerned with the trackingof the processes performed on the media assets and thedifferent version numbers, so that the “lineage” of a mediaitem can be traced. This is particularly important if royaltiesare to be allocated appropriately. These two company repre-sentatives were keen to emphasize the importance of devel-oping an integrated approach to metadata so that materialcan be transferred between systems, but they felt that it isunlikely that the whole process can be automated, and theycan also see that each broadcaster will end up with anapproach specifically tailored to its operation.

The second session on broadcast implementationsconcerned interchange file formats. Broadcast WAVE isincreasingly used as a universal file format for audio inbroadcast environments. Phil Tudor, speaking on behalf ofauthors from Snell and Wilcox, showed how this can beencapsulated within the Material Exchange Format (MXF),a SMPTE standardized interchange format. A moveto enable this encapsulation is currently under way,which if successfully standardized will enable bothAES3 digital audio data and Broadcast WAVEcontents to be transferred in MXF and reconstituted intheir original form if required. David McLeish ofSADiE then described ways in which editing informa-tion can be exchanged within the AdvancedAuthoring Format. He concluded that AAF providesthe flexibility and extensibility to enable the neces-sary compositional decisions and effects parametersto be exchanged between users or systems. AAF andMXF fulfill different purposes within the broadcastproduction chain, but are intended to be complemen-tary. A so-called zero-divergence doctrine existsbetween them to avoid incompatibility.

LIBRARIES AND ARCHIVESThree papers on the afternoon of the second day concernedthe use of audio metadata in libraries and archives. SamBrylawski of the U.S. Library of Congress spoke about thedevelopment of a digital preservation program, and in partic-ular the National Sound Conservation Archive. He said thatthey were considering using a 96-kHz, 24-bit PCM formatfor archive masters, one-bit formats being “ahead of thegame” for them at the present time. The Library of Congressarchive will be based on OAIS (Open Archival InformationSystem, see http://www.rlg.org/longterm/oais.html) in whichmedia objects will be described using MODS (the MetadataObject Description Schema, see http://www.loc.gov/standards/mods/); and metadata encoded using METS(Metadata Encoding and Transmission Standard, seehttp://www.loc.gov/standards/mets/). Long-term contentmanagement will be enhanced by additional metadata, butthe danger is that with so much information needed to becollected the process will have to be automated in some way.

Miguel Rodeño, of the Computer Science Department at theUniversidad de Alcalá in Spain and also associated with IBMStorage and Sales in Madrid, discussed the audio metadataused in the Radio Nacional de España Sound Archive ➥

AAAAEEEESSSS 22225555tttt hhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLL CCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

Left, the balcony at Church House was afavorite spot for casual conversation duringcoffee breaks and lunches. Above, Mike Story(left) and John Richards, with the southerntower of the Houses of Parliament as abackdrop, share a toast on the balcony.

Poster presentations at the 25th Conference were well received.

45 Sumner Street, Milford, MA 01757-1656, USATel: +1 (508) 478-9200 Fax: +1 (508) 478-0990

Email: [email protected] Web: www.thatcorp.com

THAT 4320 Analog Engine® lets you extend battery life without compromising sound quality in all your low-power audio signal processors.

See how low voltage, and low power coexist with low distortion, wide dynamic range, and lots of

processing power in a very tiny package.

T H AT C o r p o r a t i o nM a k i n g G o o d S o u n d B e t t e r ®

960 J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September

AAAAEEEESSSS 22225555tttt hhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLL CCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

Before the banquet (above) at the Houses ofParliament, the AES guests, including Francis andSally Rumsey and Roger Furness (left), enjoyed aglass of wine on the terrace overlooking the RiverThames.

AES 25th Conference Committee: clockwise from top left, JohnGrant, Chris Chambers, Russell Mason, Gerhard Stoll, PaulTroughton, Mark Yonge, and Heather Lane.

Project. This was a large enterprise designed to store 20thCentury Spanish sound history in a digital form. They used theBroadcast WAVE format but managed the use of the broad-cast extension chunk of the file in a clever way to storedescription metadata, retaining informationabout the original physical media fromwhich the file had been derived.

Dublin Core (DC) is a widely usedmetadata scheme that has been implement-ed in a particular way by the ScandinavianAudiovisua l Metada ta (SAM, see

http://www.nrk.no/informasjon/organisasjonen/iasa/metadata/)group for general use within the audio industry. Gunnar Dahlof NRK in Norway described the work undertaken by these 25archive specialists. The work was subsequently approved bythe EBU in the P/FRA group (Future Radio Archives). In aformat they called AXML, they extended the basic fifteen DCelements, represented using an XML syntax, with new subsetsthat have proven to cover all the needs of the broadcastproduction chain. An AXML chunk can be included within aBroadcast WAVE file.

DELIVERY OF AUDIOTim Jackson, from the Department of Computing andMathematics at Manchester Metropolitan University, intro-duced the audience to watermarking and copy protection byinformation hiding in soundtracks. He suggested that whiletraditional watermarking schemes can be used for copyprotection, information-hiding schemes might possibly beuseful for copy prevention in the analog domain, perhapsby interfering with the subsequent audio coding devicessuch as sigma-delta modulators or subband coders.

Music is increasingly being distributed online, resulting inchanging business models. Metadata enables business modelsto be developed that conform to specific rules for digital

J. Audio Eng. Soc., Vol. 52, No. 9, 2004 September 961

rights management. Nicolas Sincaglia of MusicNow covereda number of issues related to this topic, concluding that thetechnical side of the industry will continue to be challenged tobuild and operate the complex business systems to supportthe promotions and creative sales side of the industry.

Audio metadata transcription from meeting transcripts forthe Continuous Media Web (CMWeb) was described byClaudia Schremmer from the CSIRO ICT Centre inAustralia. The problem she described was that continuousmedia such as audio and video are not well suited to thecurrent searching and browsing techniques used on theWeb. A format known as Annodex can be used to streamthe media context, multiplexed with XML markup in theContinuous Media Markup Language (CMML). The projectshe described dealt with the issue of automatically generat-ing Annodex streams from complex annotated recordingscollected for use in linguistics research.

DINNER AT THE HOUSES OF PARLIAMENTThanks to Heather Lane’s superb organizational skills andauspicious connections, an opportunity was offered to dele-gates to visit the historic Houses of Parliament, just a shortwalk from Church House in Westminster. A guided tourenabled visitors to see the debating chambers and divisionlobbies, and even to play a part in acting out a typicalvoting procedure, which is done according to old customs.

Statues of former prime ministers such as WinstonChurchill, Lloyd George, and Harold Macmillan lined thewalls. The House of Lords throne glistened with 24-caratgold worth over a million pounds. The audio systems usedin the debating chambers were explained; the system usesnumerous microphones suspended on small flat cables fromthe ceiling so that they retain their angles of orientation andcan be raised and lowered for events requiring a clear lineof view, such as television relays of the state opening.

After the tour the visitors had the rare privilege of enjoy-ing a glass of wine on the famous terrace of the House over-looking the River Thames, followed by an excellent dinneraround a long table in one of the dining rooms of the Housesof Parliament.

THANKS TO…The conference committee: John Grant, chair, GerhardStoll and Russell Mason, papers cochairs, and the rest ofthe conference committee—Mark Yonge, Chris Chambers,Paul Troughton, and AES UK Secretary Heather Lane, withsupport from AES Executive Director Roger Furness—worked hard to make Metadata for Audio mega-enjoyable and to ensure that attendees experienced arewarding three days in London. The conference proceed-ings and CD-ROM can be purchased online athttp://www.aes.org/publications/conf.cfm.

AAAAEEEESSSS 22225555tttt hhhh IIIINNNNTTTTEEEERRRRNNNNAAAATTTTIIIIOOOONNNNAAAALLLL CCCCOOOONNNNFFFFEEEERRRREEEENNNNCCCCEEEE

The well preserved historic buildings of London, such as Westminster Abbey as seen from the balcony of Church House,served as reminders to conference attendees that preservation of historic audio is one of the important functions of metadata.