MARSYAS3D: A PROTOTYPE AUDIO BROWSER-EDIT OR USING …soundlab.cs.princeton.edu/publications/2001_icad_m3d.pdf · movies, music browsing, computer music and in general any ap-plication

Proceedingsof the2001InternationalConferenceon AuditoryDisplay, Espoo,Finland,July 29- August1, 2001

MARSYAS3D: A PROTOTYPE AUDIO BROWSER-EDITORUSING A LARGE SCALE IMMERSIVE VISUAL AND AUDIO DISPLAY

GeorgeTzanetakis

ComputerScienceDept.PrincetonUniversity

[email protected]

Perry Cook

ComputerScienceandMusic Dept.PrincetonUniversity

NJ [email protected]

ABSTRACTMostaudioeditingtoolsoffer limitedcapabilitiesfor browsingandediting large collectionsof files. Moreover working with manyaudiofiles tendsto clutter the limited screenspaceof a desktopmonitor. In this paperwe describeMARSYAS3D,a prototypeau-dio browser andeditor for large audiocollections. A variety of2D and3D graphicsinterfacesfor workingwith collectionsand/orindividual files have beendeveloped. Many of theseinterfacesareinformedby automaticcontent-basedaudioanalysistools.Al-thoughMARSYAS3D canbe usedwith a desktopmonitor it hasbeenspecificallydesignedasanapplicationfor thePrincetonScal-ableDisplayWall project.

ThecurrentDisplayWall systemhasan8 x 18-foot rearpro-jectionscreenwith a resolutionof 4096x 1536pixels. For sounda custom-made16-speaker surroundsystemis used.Theusersin-teractwith theWall usinga varietyof inputmethods.This immer-sive display allows for visual and aural presentationof detailedinformationfor browsing andediting large audiocollectionsandsupportsnaturalinteractive collaborationsamongmultiple simul-taneoususers.

1. INTR ODUCTION

Continuousrapid-improvementsin CPUperformanceandstoragedensityhave madepossiblethecollectionof largeamountsof au-dio data. Theability to interchangefiles over the Webandaudiocompressionformatslike MP3 have further facilitatedthis trend.Mostaudioeditingtoolsoffer limitedcapabilitiesfor browsingandeditinglargecollectionsof audiofiles (typically only thefilenameandsomeform of ID taggingis used).

In this paperwe describeMARSYAS3D, a prototypeaudiobrowserandeditordesignedfor working with largecollectionsofaudiofiles. A varietyof 2D and3D graphicalinterfacesfor work-ing with collectionsand/orindividual files have beendeveloped.Many of theseinterfacesareinformedby automaticcontent-basedaudioanalysistools. Thesetools areall integratedundera com-monunderlyingframework thatprovidesa uniqueconsistentdatarepresentation.

Working with many audio files using a desktopmonitor isdifficult becauseof the large numberof necessaryinterfacesandwindows. Although MARSYAS3D can be usedwith a desktopmonitor it hasbeenspecificallydesignedasanapplicationfor thePrincetonScalableDisplayWall project.Thislargeimmersivedis-play allows for visualandauralpresentationof detailedinforma-tion for browsingandeditinglargeaudiocollectionsandsupportsinteractive collaborationsamongmultiple simultaneoususers.

Figure1: MARSYAS3Don theDisplayWall

1.1. The Wall

ThePrincetonScalableDisplayWall projectexploresbuilding andusingalarge-formatdisplaywith multiplecommoditycomponents.The prototypesystemhasbeenoperationalsinceMarch 1998. Itcomprisesan8 x 18 foot rear-projectionscreenwith a 4 x 2 arrayof projectors.Eachprojectoris drivenby acommodityPCwith anoff-the-shelfgraphicsaccelerator. Themulti-speaker displaysys-temhas16speakersaroundtheareain front of thewall andis con-trolled by a soundserverPCthatusestwo 8-channelsoundcards.A moredetaileddescriptionof theprojectcanbe found [1]. Fig-ure 1 shows MARSYAS3D on thePrincetonDisplayWall. Highresolutioncolor imagescanbefoundat

http://www.cs.princeton.edu/˜ gtzan/marsyas.html.

1.2. Applications

Thedriving applicationbehindthedevelopmentof MARSYAS3Dis researchin content-basedaudio information tools. Other po-tentialapplicationsareauditorydisplayresearch,soundeffectsformovies, musicbrowsing, computermusicandin generalany ap-plicationareawherethereis aneedto work on largecollectionsofdigital audio.

We looselydefineaslarge any collectionwith morethan100audio files. Empirically this is the approximatenumberof fileswhereworking only with filenamesandID tagsbecomestedious.Collectionscan have an arbitrary numberof files subjectto thememoryanddisk constraintsof theparticularsystem.Theclient-server andmodulararchitectureof thesystemmakesit scalabletomorethanonecomputer. In additionit facilitatesinteractionwithnon-standardinput deviceslike SpeechRecognition.

ICAD01-1


As a typical applicationscenarioimaginehaving to createthesoundscapeof walking througha forestusinga databaseof soundeffects. As a first steponemight look for the soundof walkingon soil. After clusteringthe collection you would notice that aparticularclusterhasmany files with Walking in their filename.Expandingonthisclusteryouwouldselect4 soundfileslabelledaswalking on soil. Usingspatializedaudioyou couldautomaticallyheargeneratedthumbnailsof the4 soundfiles simultaneouslyandmakethefinal decision.In asimilar fashionyoucouldlocatesomefiles with bird callsanda file with thesoundof a waterfall. Oncethe componentsof the soundscapeareselectedyou can edit theselectedsoundfilesandperformthe necessarymixing operationsto createthe final soundscapeof walking througha forest. Theillusion of approachingthewaterfall canbeachievedby artificialreverberationandcrossfading.

All thesetasksareperformedin front of a large screenwithmultiple graphicsdisplaysthatareall consistentandconnectedtothesameunderlyingdatarepresentationmodule.Of coursehavinga large collaborative spaceallows morethanoneuserto interactsimultaneouslywith thesystem.

2. RELATED WORK

A significantinfluencefor this projecthasbeenthemusicbrows-ing work of [2, 3]. In this work, a prototype2D graphicsbrowsercombinedwith multi-streamsonic browsing was developedandevaluated.Thedriving applicationbehindthedevelopmentof theirbrowserwasmusicologicalresearchin Irish traditionalmusical-thoughtheir browsercouldeasilybeappliedto otherapplications.In additionto offering similar browsingcapabilitiesour systemisassistedby automaticcontent-basedaudioanalysistoolsandsup-ports3D graphicsdisplays. Someof the 3D graphicstools havebeendescribedin [4]. Furthermore,oncethe desiredfiles havebeenlocated,soundeditingcapabilitiesareprovided.

Commerciallyavailableaudioeditorsareeithersmallscaleed-itors that work with relatively short soundfiles, or professionaltools for audiorecordingandproduction.In bothcasesthey offervery little supportfor browsing large collections. Typically theysupportmontage-like operationslikecopying, mixing, cuttingandsplicingandsomesignalprocessingtoolssuchasfiltering, rever-berationandflanging.In additionto thesestandardfunctionalities,oureditorofferscontent-basedaudioanalysistoolslikesegmenta-tion andtimbre-basedcoloringto assisttheeditingprocess.

Although therehave beenattempts[5, 6] to develop “intelli-gent” soundeditorsmost of their researcheffort was spendad-dressingproblemsrelatedto thelimited hardwareof thattime. Au-dio analysistoolshavebeenusedsuccesfullymorerecentlyin [7],wherespeechrecognitiontechniquesareemployedto index audioandvideo.

3. SYSTEM ARCHITECTURE

Thesystemarchitecturehasbeendesignedwith theShneidermanmantrafor userinterfaces”overview first, zoomand filter, thendetails on demand” in mind [8]. Moreover the usercan con-figure the systemto suit his specificneedsand it is easyfor theprogrammerto extend the systemwith new interfaces. A veryimportantcharacteristicof the systemis that althoughtherearemultiple interfacesandviewseverythingis connectedandupdatedsynchronouslyfrom thesameunderlyingdatarepresentation.

BrowserSoundSpace

AbstractMapfile

User

AbstractBrowser

Collection

Mapfile

User

Collection AbstractEditor

Item

Mapfile

User

Item

Monitor

ItemAbstract

Mapfile

User

Analysis Tool

Collection or Item Collection or Item

Collection Data Representation

BrowserTimbreSpace3DBrowser

TimbreSpace2D

Item DataRepresentation

Sound FilesSpectogramMonitor Monitor

ClassGram

WaveformEditor

MARSYAS3D ARCHITECTURE

MusicSpeech ClassifierAnalysis Tool

SegmentationAnalysis Tool

Figure2: MARSYAS3DSystemArchitecture

Centralto thedesignof thesystemis theconceptof browsing.MarchioniniandShneiderman[9] definebrowsingas:

� an exploratory, informationseekingstrategy that dependsuponserendipity

� especiallyappropriatefor ill-defined problemsandfor ex-ploring new taskdomains

Workingwith largeaudiocollectionshasthosecharacteristics,there-fore a variety of differentBrowsers areprovided. The Browsersrenderthecollectionbasedon parametersstoredin MapFiles. Inmany casestheMapFile canbeautomaticallygeneratedby auto-matic Audio AnalysisTools. Using the Browserthe usercanex-plorethespaceandselectasubsetof thecollectionfor subsequentbrowsing or editing. Editing theselectedfiles canbe doneusingtheEditor which is alsoconfiguredby a MapFiles, possiblyauto-maticallygenerated.Duringplaybackof thefilesvariousMonitorscanbesetupto visual informationthat is updatedin real-time.Aschematicof thesystem’s architecturecanbeseenin Figure 2.

3.1. Terminology

Thefollowing termswill beusedfor themoredetaileddescriptionof thesystem’s components:

Collection is a list of audiofiles. Thereis a uniqueunderlyingdatarepresentationfor collectionsthat is sharedamongalltheBrowsers. Any subsetof acollectionisalsoacollection.

Browsers presentthecollectionfor browsingusingdifferentau-dio andvisualdisplays.Theusercanexplorethespaceandselectspecificitemsfor subsequentbrowsingor editing.

ICAD01-2


MapFiles maptheindividual itemsof thecollectionto theparam-etersrequiredby a specificBrowser, Editor or Monitor.

MapFileBuilder : is a tool for creatingMapFiles eitherautomat-ically usingAudioAnalysisToolsor manually.

Analysis Tools : apply signal-processingfeatureextraction andpatternrecognitiontechniquesto automaticallyinfer infor-mationaboutthecontentof collectionsandindividualitems.In somecasesapplyingananalysistool to a collectioncor-respondsto batchprocessingof all theindividual files.

RealTime monitors : are2D or 3D graphicdisplaysthatareup-datedin real-timewhile thesoundis playing.

Item Editor : is theonly componentthataffectstheactualcon-tentof thefiles. It offersautomaticandsemi-automaticau-dio analysistoolsin additionto standardeditinganddigitalaudioeffects.

In Figure 1 the following interfacesare shown on the dis-play wall: threeEditors, a ClassGramMonitor, a TimbreGramBrowser, anda TimbreSpace3DBrowser.

4. AUDIO ANALYSIS TOOLS

Audio analysistools apply signal processingand patternrecog-nition techniquesto extract informationaboutthe contentof au-dio files. Examplesof audioanalysistoolsareclassification,seg-mentation,similarity-retrieval, thumbnailing,principalcomponentanalysis(PCA),beat-detectionandclustering.

Examplesof classifierssupportedby thesystemare:MusicvsSpeech,Male vs Femalevs Sports,Music GenresandInstrumentfamilies. They have beenstatisticallytrainedfrom representativedatasetsandevaluated[10]. Segmentationrefersto theprocessofdetectingsegmentswhenthereis a changeof “texture” in a soundstream. The chorusof a song,the entranceof a guitar solo or achangeof speaker areexamplesof segmentationboundaries.Seg-mentationcanbe usedto breaka singlesoundfileinto segmentsthat can be subsequentlyviewed as a collection using the “Ex-plode”operation.

Audio thumbnailingis the processof creatinga short sum-marysoundfile thatbestcapturestheessentialelementsof a largesoundfile. Thumbailingis very importantfor quick browsingus-ing theSoundSpacebrowser. Clusteringcanbeusedto hierarchi-cally groupfileswhereno labellinginformationis provided.Beat-detectionalgorithms[11] measurethe rhythmic beats-per-minute(bpm) for musicand. Finally PCA is a dimensionality-reductiontechnique[12], usedin oursystemto maphigh-dimensionalspec-tral featurespacesto lower-dimensionalMapFile spaces. Moredetailsaboutthesetechniquescanbefoundin [13, 10].

An importantaspectof MARSYAS3D is that all thesetech-niquesare integratedundera commonframework and are usedto supportthe Browsers, Editor, andMonitors. For examplein aTimbreSpace, PCA canbeusedto mapaudiofiles to pointsin 3Dspaceandclusteringcanbeusedto color them.Segmentationcanbeusedto quickly locatearegionfor editingandclassificationcanbeusedto inform a ClassGrammonitor. Theflexibility providedby MapFiles doesnot enforcespecifictechniquesfor eachtypeof display. For examplea TimbreSpace2Dcould easily be con-structedby manuallymoving pointsin a 2D planeaswell aswithautomatictechniques.

Figure3: TimbreSpace3D

5. BROWSERS

The key operationany browsermustsupportis “Zooming”. Byzoominga selectedsubsetof a collectionis openedasa new col-lection. This is easyto implementbecauseany subsetof a col-lection is alsoa collection(samefor thecorrespondingMapFile).Eachbrowseralsosupportsa currentlyselecteditem. It is impor-tant to notethat sinceall browserareupdatedsynchronouslyanypossiblecombinationof themcanbeused.

Thefollowing browsersareimplementedin thesystem:

Standard is just a list of fileanamescorrespondingto the audiofiles. Standardsingleandmultiple mouseselectionis sup-ported.

Tree rendersthe itemsbasedon a treerepresentation.This treerepresentationcanbegeneratedby automaticclassification.For examplethe first level of the tree could be Music vsSpeechandSpeechcouldbefurthersubdividedto Male vsFemaleandsoforth.

Timbr egramBrowser renderseachfile asTimbreGram that arerectanglesconsistingof vertical color stripeswhereeachstripe correspondsto a short segmentof sound. Time ismappedfrom left to right. They revealsoundsthataresimi-lar by colorsimilarity andtimeperiodicity. For exampletheTimbreGramsof two walkingsoundswith differentnumberof footstepswill have the sameperiodicity in color. Dif-ferentcolormappingscanbeusedandPCAcanbeappliedto reducespectralfeaturesto thecolor parametersfor eachstripe.Figure 4 showsTimbreGramcoloringsuperimposedover a waveformdisplay.

Timbr eSpace2Drepresenteachitemasageometricshapeonplanein a similar fashionto the visual browser of [2, 3]. Theplacement,shapeandcolor of eachitem arecontrolledbytheMapfile. PCAcanbeusedfor automaticplacementandclassificationor clusteringcanbeusedfor automaticcolor-ing or shape.

Timbr eSpace3Dis similar to the TimbreSpace3Din 3 dimen-sions. The usercanzoom,rotateand translatethe space.Principalcurves[14] canbeusedtomovesequentiallythroughthedata.This displayis especiallyappropriatefor thedis-play wall where3D illusion is much stronger. Figure 3shows aTimbreSpace3Dbrowser.

ICAD01-3


Figure4: MARSYAS3DEditor

SoundSpacetakesadvantageof thecocktail-partyeffect in asim-ilar fashionto the music browser of [2, 3]. As the userbrowses,a neighborhood(aura)aroundthe currently se-lected item is auralizedby playing all the correspondingfiles simultaneously. In theMapFile eachfile is mappedtooneof the16 loudspeakersof thesystemandthe intensitylevelsarenormalizedfor betterpresentation.Althoughall16 loudspeakers can be usedwe have found that at most8 simultaneousaudiostreamsareuseful. Using moread-vanced3D audio renderingtechniquesto placefiles any-wherein thespaceis plannedfor thefuture.To shortentheplaybacktime,equallengththumnailscanbeautomaticallygeneratedfor eachfile.

6. EDIT OR

Theeditorsupportsstandardoperationslikecutting,copying,past-ing, splicingetc. Thoseoperationswork in theindividual file andbetweendifferentfiles that areopenat the sametime. A limitednumberof digital audioeffectslike reverberationandfiltering aresupported.Addingnew effectsto thesystemis easy.

In addition to thesestandardeditor featuresAudio analysistoolslike classification,segmentation,thumbnailingcanassisttheediting process. Computer-assistedannotationis also provided.As anexample,a usercanquickly detectthesaxophonesolo in aJazzpieceandannotateit with a textual description.Anotherfea-tureof theeditoris “Explode” whichusesautomaticsegmentationto breaktheaudioandthencreatesacollectionof thesegmentsforbrowsing. Figure 4 show theMARSYAS3D Editor with a wave-form displaywith TimbreGramcoloring.

7. MONIT ORS

Monitorsdisplayinformationthatis updatedin real-timewhile theaudiofilesarebeingplayed.In many cases,they usetheresultsofautomaticaudioanalysistoolsthatwork in real-time.

Waveform is the traditionaldisplay of the signalusedby com-mercialeditors.

Spectrum is a traditionalspectrumdisplayshowing the magni-tudein Decibelsof theShortTimeFourrierTransform(STFT)of thesignal.

Figure5: ClassGramBrowser(music-speech genres

Spectogram rendersthe sameinformationasthe Spectrumasagreyscaleimage. Time is mappedfrom left to right andthe frequency amplitudesare mappedto greyscalevaluesaccordingto their magnitude.

Waterfall showsawaterfall-like3D displayof acascadeof Spec-trum plotsin 3D space.

Timbr eBall shows a spheremoving a 3D cube.Eachaxisof thecubecanbemappedto aspecificaudioanalyssfeature.Forexample,loudness,spectralcentroid,andspectralflux (fea-turesknown to beperceptuallyimportant)couldbemappedto thex,y,z axis.A shadow is providedfor betterdepthper-ception.

Amplitude shows anamplitudemeterfor thesignal.

Metr onome shows the bpm of the signalanda confidencelevelfor how strongthe beatis. The beatis detectedautomati-cally usingtechniquessimilar to [11].

FeaturePlot shows a plot of a user-selectedfeature.

ClassGram showstherelativeweightsof classificationconfidence.For eachclassa confidencemeasure,rangingfrom 0.0 to1.0 is calculatedandusedto move up or down cylinderscorrespondingto eachclass.Thecylinderscanbe texture-mappedwith characteristicimagesfor eachclass.This dis-play canreveal detailedinformationaboutthe underlyingcontent.For examplein a rapsongboththe“Male Speech”cylinderandthe“Rap” cylinder wouldgo up.

8. MAPFILES

MapFiles control theappearanceandbehaviors of thevariousin-terfacesin MARSYAS3Dby mappingthefilenamesto specificpa-rameters.Forexamplein aTimbreSpace3Dthefilenameswouldbemappedto 3D position,shapeandcolor. Theseparameterscouldbe automaticallygeneratedby PCA, classificationandclusteringor couldbecreatedmanually. In a SoundSpaceeachfile would bemappedto aspecificloudspeaker andintensity. MapFileBuilder isthe applicationfor creatingMapFiles for collections. Currentlyitis a separateapplicationwithouta graphicaluserinterface.

ICAD01-4


9. COLLECTIONS

A varietyof largeaudiocollectionswereusedfor developingandtestingMARSYAS3D.Thelengthof theaudiofilesvariesfrom 30seconds(mostof the collections)to 3-4 minutes. The followingcollectionswhereutilized:

MusicSpeech 200audioclips speechandmusic

Radio 100audioclips recordedfrom FM radio

Jazz 100jazzaudioclips recordedfrom CDs

Instruments 913isolatedmusicalinstrumentnotesfromtheMcGillMUMS CDs

Sfx 200 soundeffects clips recordedfrom varioussoundeffectlibraries

10. IMPLEMENT ATION

The software is implementedfollowing a client-server architec-ture. All thecomputation-intensive signalprocessingandpatternrecognitionalgorithmsrequiredfor audioanalysisareperformedusinga server written in C++. Thecodeis optimizedresultinginreal-timefeaturecalculationandmonitorupdates.Thecollectionbrowser, item editor, real-timemonitorsandthe integrationcodearewrittenin JAVA andJAVA3D. ThesoftwarerunsonLinux, So-laris, Irix andWindows (95,98,00,NT).A freesoftwareversionofsomeof thecomponentsis providedin :

http://www.cs.princeton.edu/˜ gtzan/marsyas.htmlThesamelink containscolor picturesof thefiguresandaddi-

tional informationrelatedto this paper.

11. FUTURE DIRECTIONS

Thesystemisstill aresearchprototypeunderdevelopmentsotherehasbeenno formalevaluation.A task-basedevaluationis plannedfor the future. In this evaluation,comparisonbetweenthe effec-tivenessof differentbrowsersandtheircombinationswill bemade.In additionthesystemwill becomparedto moretraditionalformsof browsingandediting.

TheDisplayWall projecthasimplementeda varietyof alter-native input methodsin additionto standardkeyboardandmousenavigation. We are exploring the useof speechrecognitionin-put for controlling thesystem.Theclient architectureof theuserinterfaceswill make the integrationwith thespeechsystemeasy.Speechrecognitionis alreadybeingusedfor acircuit viewer. Cur-rently theplacingof thewindows is doneby thewindow managerrequiringmanualadjustmentwhich is difficult onsucha largedis-play. Therefore,intelligent automaticplacementof windows isplannedfor thefuture.

Currentlyall thebookkeeping(mainly theMapFiles) is doneusingflat text files. We areconsideringtheuseof a databasesys-temto organizethis information.Finally agraphicaluserinterfacefor theMapFileBuilder is underdevelopment.

Our hope is that somedayin the future, soundediting willlook moresimilar to ourprototypesystemthanto whatis currentlycommerciallyavailable.

12. ACKNOWLEDGEMENTS

ThankstoDougTurnbull for workontheClassGramfor FM RadioandJazz. This work wasfundedunderNSF grant9984087andfrom gifts from Intel andArial Foundation.

13. REFERENCES

[1] Kai Li et.al, “Building and using a scalabledisplay wallsystem,” IEEE ComputerGraphicsand Applications, vol.21,no.3, July/August2000.

[2] M. FernstormandC. McNamara,“After directmanipulation- directsonification,” in Proc.Int.Confon Auditory Display,ICAD, 1998.

[3] M. FernstormandL. Bannon, “Multimedia browsing,” inProc.Int.ConfonComputerHumanInteraction,CHI, 1997.

[4] G. Tzanetakisand P. Cook, “3D graphicstools for soundcollections,” in Proc.COSTG6Conferenceon Digital AudioEffects,DAFX, 2000.

[5] C. Chafe,B. Mont-Reynaud,andL. Rush, “Toward an in-telligenteditorof digital audio:Recognitionof musicalcon-structs,” ComputerMusicJournal, vol. 6, no. 1, pp. 30–41,1982.

[6] J.A. Moorer, “The lucasfilmaudiosignalprocessor,” Com-puter Music Journal, vol. 6, no. 3, pp. 30–41,1982, alsoappearsin ICASSP82.

[7] A. Hauptmannand M. Witbrock, “Informedia: News-on-demandmultimedia information acquisition and re-trieval,” in Intelligent Multimedia Information Retrieval,chapter10, pp. 215–240.MIT Press,Cambridge,Mass.,1997,http://www.cs.cmu.edu/afs/cs/user/alex/www/.

[8] B. Shneiderman,Designingthe user interface: strategiesfor effectivehuman-computerinteraction, Addison-Wesley,1987.

[9] G. MarchioniniandB. Shneiderman,“Finding factsversusbrowsingknowledgein hypertext systems,” IEEEComputer,vol. 19,pp.70–80,1988.

[10] G. Tzanetakisand P. Cook, “Audio information retrieval(AIR) tools,” in Proc. Int. Symposiumon Audio InformationRetrieval, ISMIR, 2000.

[11] E. Scheirer, “Tempoandbeatanalysisof acousticmusicalsignals,” J.Acoust.Soc.Am, vol. 103,no. 1, pp. 588,601,Jan1998.

[12] L.T.Jolliffe, Principal ComponentAnalysis, Springer-Verlag,1986.

[13] J.Foote,“An overview of audioinformationretrieval,” ACMMultimediaSystems, vol. 7, pp.2–10,1999.

[14] T. Hermann,P. Meinicke, and H. Ritter, “Principal curvesonification,” in Proc. Int. Confon Auditory Display, ICAD,2000.

ICAD01-5

Documents

MARSYAS3D: A PROTOTYPE AUDIO BROWSER-EDIT OR USING …soundlab.cs.princeton.edu/publications/2001_icad_m3d.pdf · movies, music browsing, computer music and in general any ap-plication