Computational Media Aesthetics Sounding Objects

Interactive systems,virtualenvironments, andinformation displayapplications needdynamic soundmodels rather thanfaithful audioreproductions. Thisimplies three levelsof research: auditoryperception, physics-based soundmodeling, andexpressiveparametric control.Parallel progressalong these threelines leads toeffective auditorydisplays that cancomplement orsubstitute visualdisplays.

When designing a visual widgetwe attempt to understand theusers’ needs. Based on initialrequirements, we can sketch

the idea on paper and start to test it. We can askpotential users for advice and they might tell usthat, for example, “it should be like a rectangu-lar button, with a red border, in the upper lefthand corner, with a frame around it.” As testingand new sketches evolve, we can then code thewidget and do further tests.

Designing auditory-enhanced interfaces im-mediately poses a number of problems that aren’tpresent in the visual case. First, the vocabularyfor sound is vague—for example, “when I do this,it goes whoosh,” or “it should sound like open-ing a door.” What does whoosh or an openingdoor really sound like? Is one person’s whooshthe same as another person’s whoosh? What canI do, as a designer, if I want to continuously con-trol the sound of a door opening? A sample froma sound effect collection is useless in this case.This article aims to shed some light on how psy-chologists, computer scientists, acousticians, andengineers can work together and address theseand other questions arising in sound design forinteractive multimedia systems.

Sounding ObjectIt’s difficult to generate sounds from scratch

with a given perceived identity using signal-basedmodels, such as frequency modulation or additive

synthesis.1 However, physically based modelsoffer a viable way to get naturally behavingsounds from computational structures that caneasily interact with the environment. The mainproblem with physical models is their reduceddegree of generality—that is, models are usuallydeveloped to provide a faithful simulation of agiven physical system, typically a musical instru-ment. Yet, we can build models that retain directphysical meaning and are at the same time con-figurable in terms of physical properties such asshape, texture, kind of excitation, and so on. Weused this approach in the Sounding Object2

(http://www.soundobject.org) project that theEuropean Commission funded to study new audi-tory interfaces for the Disappearing Computer ini-tiative (http://www.disappearing-computer.net).

Based on our experience with the SoundingObject project, and other recent achievements insound science and technology, we’ll explain how

❚ the sound model design should rely on per-ception to find the ecologically relevant audi-tory phenomena and how psychophysics canhelp in organizing access to their parameters;

❚ physics-based models can be “cartoonified” toincrease both computational efficiency andperceptual sharpness;

❚ the physics-related variables have to be variedin time and organized in patterns to conveyexpressive information (this is the problem ofcontrol); and

❚ we can construct interactive systems aroundsound and control models.

We’ll illustrate each of these points with oneexample. You can download the details, softwareimplementations, and sound excerpts of eachexample from http://www.soundobject.org.

Perception for sound modelingHumans sense the physical world and form

mental images of its objects, events, and process-es. We perceive physical quantities according tononlinear, interrelated scales. For instance,humans perceive both frequency and intensityof sine waves according to nonlinear scales(called, respectively, mel and son), and intensityperception is frequency dependent. The disci-pline of psychoacoustics has tried, for more thana century, to understand how sound perception

42 1070-986X/03/$17.00 © 2003 IEEE Published by the IEEE Computer Society

SoundingObjects

Davide RocchessoUniversity of Verona, Italy

Roberto BresinRoyal Institute of Technology, Sweden

Mikael FernströmUniversity of Limerick, Ireland

Computational Media Aesthetics

works.3 Although the amount of literature on thissubject is enormous, and it can help us guide thedesign of auditory displays, most of it covers per-ception of sounds that rarely occur in real life,such as pure sine waves or white noise. A signifi-cantly smaller number of studies have focused onsounds of musical instruments, and an evensmaller body of literature is available for every-day sounds. The latter are particularly importantfor designing human–computer interfaces, infor-mation sonification, and auditory displays.

Even fundamental phenomena, such as theauditory perception of size and shape of objects

or the material they’re made of, have receivedattention only recently (see Vicario et al.4 for anannotated bibliography). More complex phe-nomena, such as bouncing and breaking eventsor liquids pouring into vessels, have drawn theattention of ecological psychologists and inter-face designers,5 mainly because of their expres-siveness in conveying dynamic informationabout the environment.

As an example, we report the results of a testthat we conducted using a large subject set (197undergraduate computer science students) toevaluate the effectiveness of liquid sounds as an

43

Ap

ril–June 2003

The introduction of sound models in interfaces for effectivehuman–computer interaction dates back to the late 1980s,when William Gaver developed the SonicFinder for Apple’sMacIntosh, thus giving an effective demonstration of auditoryicons as a tool for information and system status display.1,2 TheSonicFinder extended Apple’s file management application find-er using auditory icons with some parametric control. Thestrength of the SonicFinder was that it reinforced the desktopmetaphor, creating an illusion that the system’s componentswere tangible objects that you could directly manipulate. Apple’slater versions of MacOS (8.5 onward) implemented appearancesettings—which control the general look and feel of the desk-top metaphor, including sound—based on the SonicFinder.

During the last 10 years, researchers have actively debatedissues such as auditory icons, sonification, and sound designwhile the International Community for Auditory Display(http://www.icad.org) has emerged as an international forum,with its own annual conference. In addition, several other con-ferences in multimedia and computer–human interaction havehosted an increasing number of contributions about auditoryinterfaces. Researchers have developed several important audi-tory-display-based applications that range from computer-assisted surgery to continuous monitoring of complex systemsto analysis of massive scientific data.3 However, the relevantmass of experiments in sound and multimedia have revealed asubstantial lack of methods for designing meaningful, engag-ing, and controllable sounds.

Researchers have attempted to fill this gap from two oppo-site directions. Some researchers have tried to understand andexploit specific sound phenomena. For example, Gaver4 stud-ied and modeled the sound of pouring liquids. Others haveconstructed general and widely applicable sound informationspaces. For instance, Barrass proposed an auditory informationspace analogous to the color information spaces based on per-ceptual scales and features.5

The need for sounds that can convey information about theenvironment yet be expressive and aesthetically interesting, led

to our proposal of sound spaces constructed on dynamic soundmodels. These sound spaces build on synthesis and processingmodels that respond continuously to user or system control sig-nals. We proposed that the architecture for such a sound spacewould be based on perceptual findings and would use thesound modeling technique that’s most appropriate for a giventask. Sound events have both identity and quality aspects,6 sophysical models are appropriate for representing a sound’s iden-tity and signal-based models are more appropriate for adjust-ing a sound’s quality.7 In fact, the nature of a sound’s physicalstructure and mechanisms (such as a struck bar, a rubbed mem-brane, and so on) determines a sound source’s identity whilethe specific instances of physical quantities (for example, metalbars are brighter than wood bars) determine its qualitativeattributes.

References1. W.W. Gaver, “The Sonic Finder: An Interface that Uses Auditory

Icons,” Human–Computer Interaction, vol. 4, no. 1, Jan. 1989,

pp. 67-94.

2. W.W. Gaver, “Synthesizing Auditory Icons,” Proc. INTERCHI,

ACM Press, 1993, pp. 228-235.

3. G. Kramer et al., Sonification Report: Status of the Field and

Research Agenda, 1997, http://www.icad.org/websiteV2.0/

References/nsf.html.

4. W.W. Gaver, “How Do We Hear in the World? Explorations in

Ecological Acoustics,” Ecological Psychology, vol. 5, no. 4, Sept.

1993, pp. 285-313.

5. S. Barrass, “Sculpting a Sound Space with Information Proper-

ties,” Organised Sound, vol. 1, no. 2, Aug. 1996, pp. 125-136.

6. S. McAdams, “Recognition of Sound Sources and Events,”

Thinking in Sound: The Cognitive Psychology of Human Audition,

S. McAdams and E. Bigand, eds., Oxford Univ. Press, 1993,

pp. 146-198.

7. M. Bezzi, G. De Poli, and D. Rocchesso, “Sound Authoring Tools

for Future Multimedia Systems,” Proc. IEEE Int‘l Conf. Multimedia

Computing and Systems, vol. 2, IEEE CS Press, 1999, pp. 512-517.

Previous Research

auditory substitute of a visual progress bar. Theadvantages for this application are evident, as theuser may continuously monitor backgroundactivities without being distracted from the fore-ground work. We used 11 recordings of 0.5- and1-liter plastic bottles being filled or emptied withwater. The sound files’ lengths ranged from 5.4to 21.1 seconds. The participants used head-phones to listen to the sounds. They wereinstructed to respond using the “0” and “1” keyson their keyboards. We divided the experimentinto three sections to detect

❚ the action of filling or emptying,

❚ whether the bottle was half full or empty, and

❚ whether the bottle was almost full.

We randomized the order between sections andbetween sound files. When we asked the partic-ipants to respond if the sound was filling oremptying, 91.8 percent responded correctly foremptying sounds and 76.4 percent for fillingsounds. In the section where they responded towhen the bottle was half full or empty, respons-es during filling had a mean of 0.40 (range nor-malized to 1) with a standard deviation of 0.13.During emptying responses had a mean of 0.59with a standard deviation of 0.18. In the sectionwhere users responded to whether the bottlewas almost full (just about to overflow) oralmost empty, the mean value for filling soundswas 0.68 with a standard deviation of 0.18, andfor emptying sounds the mean was 0.78 with astandard deviation of 0.18. Based on theseresults, we envisage successfully using this kindof sound, for example, as an auditory progressbar, because users can distinguish between fullor empty and close to completion. From infor-mal studies, we’ve noted that users also can dif-ferentiate between the bottle sizes, pouring rate,and viscosity.

Perception is also important at a higher level,when sounds are concatenated and arrangedinto patterns that can have expressive content.Humans have excellent capabilities of deducingexpressive features from simple moving pat-terns. For instance, the appropriate kinematicsapplied to dots can provide a sense of effort, orgive the impression that one dot is pushinganother.6 Similarly, temporal sequences ofsounds can convey expressive information ifproperly controlled.

Cartoon sound modelsIn information visualization and human–com-

puter visual interfaces, photorealistic rendering isoften less effective than nicely designed cartoons.Stylized pictures and animations are key compo-nents of complex visual displays, especially whencommunication relies on metaphors.6 Similarly,auditory displays may benefit from sonic car-toons5—that is, simplified descriptions of sound-ing phenomena with exaggerated features. Weoften prefer visual and auditory cartoons overrealistic images and sounds7 because they ease ourunderstanding of key phenomena by simplifyingthe representation and exaggerating the mostsalient traits. This can lead to a quicker under-standing of the intended message and the possi-bility of detailed control over the expressivecontent of the picture or sound.

Sound design practices based on the under-standing of auditory perception8 might try to usephysics-based models for sound generation andsignal-based models for sound transformation. Inthis way it’s easier to impose a given identity toobjects and interactions (such as the sound ofimpact between two ceramic plates). Some qual-ity aspects, such as the apparent size and distanceof objects, may be adjusted through time-fre-quency manipulations9 when they’re not acces-sible directly from a physical model.

Physics-based modelsThe sound synthesis and computer graphics

communities have increasingly used physicalmodels whenever the goal is a natural dynamicbehavior. A side benefit is the possibility of con-necting the model control variables directly tosensors and actuators since the physical quanti-ties are directly accessible in the model. We canuse several degrees of accuracy in developing aphysics-based sound model. Using a finite-ele-ment model of interacting objects would makesense if tight synchronization with realisticgraphic rendering is required, especially in thecases of fractures of solids or fluid-dynamic phe-nomena.10 Most often, good sounds can beobtained by simulating the rigid-body dynamicsof objects described by piecewise parametric sur-faces.11 However, even in this case each impactresults in a matrix of ordinary differential equa-tions that must be solved numerically. If the dis-play is mainly auditory, or if the visual objectdynamics can be rendered only poorly (for exam-ple, in portable low-power devices), we can relyon perception to simplify the models signifi-

44

IEEE

Mul

tiM

edia

cantly, without losing either their physical inter-pretability or dynamic sound behavior.

As an example, consider the model of a bounc-ing object such as a ball. On one extreme, we cantry to develop a finite-element model of the aircavity, its enclosure, and the dynamics of impactsubject to gravity. On the other hand, if we havea reliable model of nonlinear impact between astriking object and a simple mechanical res-onator, we can model the system as a point-likesource displaying a series of resonant modes. Thesource can be subject to gravity, thus reproducinga natural bouncing pattern, and we can tune themodes to those of a sphere, thus increasing theimpression of having a bouncing ball.

Of course, this simplification based on lump-ing distributed systems into point-like objectsintroduces inevitable losses in quality. Forinstance, even though we can turn the ball intoa cube by moving the modal resonances to theappropriate places, the effect wouldn’t resemblethat of a bouncing cube just because the tempo-ral pattern followed by a point-like bouncingobject doesn’t match that of a bouncing cube.However, we can introduce some controlled ran-domness in the bouncing pattern in such a waythat the effect becomes perceptually consistent.Again, if the visual display can afford the samesimplifications, the overall simulated phenome-non will be perceived as dynamic.12

The immediate benefit of simplified physics-based models is that they can run in real time inlow-cost computers, and gestural or graphicalinterfaces can interactively control the models.Figure 1 shows the graphical Pure Data patch ofa bouncing object, where sliders can control thesize, elasticity, mass, and shape. (For more infor-mation about Pure Data, see http://www.pure-data.org.) In particular, the shape slider allowscontinuous morphing between sphere and cubevia superellipsoids and it controls the position ofresonances and the statistical deviation from theregular temporal pattern of a bouncing ball.

Drinking lemonade with a strawIn physical sciences it’s customary to simplify

physical phenomena for better understanding offirst principles and to eliminate second-ordereffects. In computer-based communication,when the goal is to improve the clarity and effec-tiveness of information, the same approach turnsout to be useful, especially when it’s accompa-nied by exaggeration of selected features. Thisprocess is called cartoonification.5

Consider the cartoon in Figure 2, displayinga mouse-like character drinking lemonade witha straw. The illustration comes from a children’sbook, where children can move a carton flap

45

Ap

ril–June 2003

Figure 1. Pure Data patch for a bouncing object.

Figure 2. Maisy

drinking with a straw.

(Illustration from

“Happy Birthday

Maisy,” ©1998 by Lucy

Cousins. Reproduced by

permission of the pub-

lisher Candlewick

Press, Inc., Cambridge,

Mass., on behalf of

Walker Books Ltd.,

London.)

that changes the level of liquid in the glass. Withthe technology that’s now available, it wouldn’tbe too difficult to augment the book with a smalldigital signal processor connected to sensors andto a small actuator, so that the action on the flapwould also trigger a sound. Although many toysfeature this, the sounds are almost inevitablyprerecorded samples that, when played fromlow-cost actuators, sound irritating. Even inexperimental electronically augmented bookswith continuous sensors, due to the intrinsiclimitations of sampled sounds, control is usuallylimited to volume or playback speed.13 The inter-action would be much more effective if thesound of a glass being emptied is a side effect ofa real-time simulation that responds continu-ously to the actions. Instead of trying to solvecomplex fluid-dynamic problems, we use a high-level analysis and synthesis of the physical phe-nomena as follows:

1. take one micro event that represents a smallcollapsing bubble;

2. arrange a statistical distribution of microevents;

3. filter the sound of step 2 through a mild res-onant filter, representing the first resonanceof the air cavity separating the liquid surfacefrom the glass edge;

4. filter the sound of step 3 with a sharp combfilter (harmonic series of resonances obtained

by a feedback delay line), representing the fil-tering effect of the straw; and

5. add abrupt noise bursts to mark the initialand final transients when the straw rapidlypasses from being filled with air to being filledwith liquid, and vice versa.

The example of the straw is extreme in thecontext of cartoon sound models, as in realitythis physical phenomenon is usually almostsilent. However, the sound model serves the pur-pose of signaling an activity and monitoring itsprogress. Moreover, because we built the modelbased on the actual physical process, the modelintegrates perfectly with the visual image and fol-lows the user gestures seamlessly, thus resultingin far less irritating sounds.

On the importance of controlSound synthesis techniques have achieved

remarkable results in reproducing musical andeveryday sounds. Unfortunately, most of thesetechniques focus only on the perfect synthesis ofisolated sounds, thus neglecting the fact thatmost of the expressive content of sound messagescomes from the appropriate articulation of soundevent sequences. Depending on the sound syn-thesis technique, we must design a system togenerate control functions for the synthesisengine in such a way that we can arrange singleevents in naturally sounding sequences.14

Sound control can be more straightforward ifwe generate sounds with physics-based tech-niques that give access to control parametersdirectly connected to sound source characteris-tics. In this way, the design of the control layercan focus on the physical gestures rather thanrelying on complex mapping strategies. In par-ticular, music performance studies have analyzedmusical gestures, and we can use this knowledgeto control cartoon models of everyday sounds.

Control modelsRecently, researchers have studied the rela-

tionship between music performance and bodymotion. Musicians use their body in a variety ofways to produce sound. Pianists use shoulders,arms, hands, and feet; trumpet players use theirlungs and lips; and singers use their vocal chords,breathing system, phonatory system, and expres-sive body postures to render their interpretation.When playing an interleaved accent in drum-ming, percussionists prepare for the accented

46

IEEE

Mul

tiM

edia

Sound control can be more

straightforward if we

generate sounds with

physics-based techniques

that give access to control

parameters directly

connected to sound source

characteristics.

stroke by raising the drumstick uphigher, thus arriving at the strikingpoint with larger velocity.

The expressiveness of sound ren-dering is largely conveyed by subtlebut appropriate variations of thecontrol parameters that result froma complex mixture of intentionalacts and constraints of the humanbody’s dynamics. Thirty years ofresearch on music performance atthe Royal Institute of Technology(KTH) in Stockholm has resulted inabout 30 so-called performancerules. These rules allow reproduc-tion and simulation of differentaspects of the expressive renderingof a music score. Researchers havedemonstrated that they can combine rules andset them up in such a way that generates emo-tionally different renderings of the same piece ofmusic. The results from experiments with expres-sive rendering showed that in music perfor-mance, emotional coloring corresponds to anenhanced musical structure. We can say the samething about hyper- and hypo-articulation inspeech—the quality and quantity of vowels andconsonants vary with the speaker’s emotionalstate or the intended emotional communication.Yet, the phrase structure and meaning of thespeech remain unchanged. In particular, we canrender emotions in music and speech by con-trolling only a few acoustic cues.15 For example,email users do this visually through emoticonssuch as ;-). Therefore, we can produce cartoonsounds by simplifying physics-based models andcontrolling their parameters.

Walking and runningMusic is essentially a temporal organization

of sound events along short and long timescales. Accordingly, microlevel and macrolevelrules span different time lengths. Examples ofthe first class of rules include the Score LegatoArticulation rule, which realizes the acousticaloverlap between adjacent notes marked legato inthe score, and the Score Staccato Articulationrule, which renders notes marked staccato in thescore. Final Retard is a macrolevel rule that real-izes the final ritardando typical in Baroquemusic.16 Friberg and Sundberg demonstratedhow their model of final ritardando was derivedfrom measurements of stopping runners.Recently, researchers discovered analogies in

timing between walking and legato and runningand staccato.17 These findings show the interre-lationship between human locomotion andmusic performance in terms of tempo controland timing.

Friberg et al.18 recently studied the associationof music with motion. They transferred measure-ments of the ground reaction force by the footduring different gaits to sound by using the verti-cal force curve as sound-level envelopes for tonesplayed at different tempos. Results from listeningtests were consistent and indicated that each tone(corresponding to a particular gait) could clearlybe categorized in terms of motion. These analo-gies between locomotion and music performanceopen new possibilities for designing control mod-els for artificial walking sound patterns and forsound control models based on locomotion.

We used the control model for humanizedwalking to control the timing of the sound ofone step of a person walking on gravel. We usedthe Score Legato Articulation and Phrase Archrules to control the timing of sound samples.We’ve observed that in walking there’s an over-lap time between any two adjacent footsteps (seeFigure 3) and that the tendency in overlap timeis the same as observed between adjacent notesin piano playing: the overlap time increases withthe time interval between the two events. Thisjustifies using the Score Legato Articulation rulefor walking. The Phrase Arch rule used in musicperformance renders accelerandi and rallentandi,which are a temporary increase or decrease of thebeat rate. This rule is modeled according to veloc-ity changes in hand movements between twofixed points on a plane. We thought that the

47

Ap

ril–June 2003

(a) (b)

Figure 3. Waveform and spectrogram of walking and running sounds on gravel. (a) Three

adjacent walking gaits overlap in time. The red rectangle highlights the time interval

between the onset time of two adjacent steps. (b) Vertical white strips (micropauses)

correspond to flight time between adjacent running gaits. The red rectangle highlights the

time interval between the onset and offset times of one step.

Phrase Arch rule would help us control walkingtempo changes on a larger time scale.

Human running resembles staccato articula-tion in piano performance, as the flight time inrunning (visible as a vertical white strip in thespectrograms in Figure 3) can play the same roleas the key-detach time in piano playing. Weimplemented the control model for stoppingrunners by applying the Final Retard rule to thetempo changes of sequences of running steps ongravel.

We implemented a model for humanizedwalking and one for stopping runners as PureData patches. Both patches allow controlling thetempo and timing of sequences of simple soundevents. We conducted a listening test comparingstep sound sequences without control tosequences rendered by the control models pre-sented here. The results showed that subjectspreferred the rendered sequences and labeledthem as more natural, and they correctly classi-fied different types of motion produced by themodels.4 Recently, we applied these controlmodels to physics-based sound models (see the“Web Extras” sidebar for a description of thesesound examples available at http://computer.org/multimedia/mu2003/u2toc.htm).

The proposed rule-based approach for soundcontrol is only a step toward the design of moregeneral control models that respond to physicalgestures. In our research consortium, we’re apply-ing the results of studies on the expressive ges-tures of percussionists to impact-sound models19

and the observations on the expressive characterof friction phenomena to friction sound models.Impacts and frictions deserve special attentionbecause they’re ubiquitous in everyday sound-scapes and likely to play a central role in sound-based human–computer interfaces.

New interfaces with soundBecause sound can enhance interaction, we

need to explore ways of creating and testing suit-able auditory metaphors. One of the problemswith sounds used in auditory interfaces today isthat they always sound the same, so we need tohave parametric control. Then, small events canmake small sounds; big events make big sounds;the effort of an action can be heard; and so on.As Truax20 pointed out, before the developmentof sound equipment in the 20th century, nobodyhad ever heard exactly the same sound twice. Healso noted that fixed waveforms—as used in sim-ple synthesis algorithms—sound unnatural, life-less, and annoying. Hence, with parametriccontrol and real-time synthesis of sounds, we canget rich continuous auditory representation indirect manipulation interfaces. For instance, asimple audiovisual animation in the spirit ofMichotte’s famous experiments6 (see the geo-metric forms animation in the “Web Extras” side-bar) shows how sound can elicit a sense of effortand enhance the perceived causality of two mov-ing objects.

48

IEEE

Mul

tiM

edia

Web ExtrasVisit IEEE MultiMedia’s Web site at http://

www.computer.org/multimedia/mu2003/u2toc.htm to view the following examples thatwe designed to express the urgency of dynam-ic sound models in interactive systems:

❚ Physical animation (AVI video) of simplegeometric forms. The addition of consistentfriction sounds (generated from the sameanimation) largely contributes to constructa mental representation of the scene and toelicit a sense of effort as experienced whenrubbing one object onto another.

❚ Sound example (Ogg Vorbis file) of gradualmorphing between a bouncing sphere anda bouncing cube. It uses the cartoon modeldescribed in the “Cartoon sound models”section.

❚ Sound example (Ogg Vorbis file) of Maisydrinking several kinds of beverages at differ-ent rates and from different glasses. Webased this on the cartoon model describedin the “Cartoon sound models” section.

❚ Sound example (Ogg Vorbis file) of an ani-mated walking sequence constructed fromtwo recorded step samples (gravel floor)and one that uses a dynamic cartoon modelof crumpling (resembles steps on a crispysnow floor). The rules described in the “Onthe importance of control” section governthis animation.

Players for the Ogg Vorbis files are availableat http://www.vorbis.com for Unix, Windows,Macintosh, BeOS, and PS2.

DesignBarrass21 proposed a rigorous approach to

sound design for multimedia applicationsthrough his Timbre-Brightness-Pitch InformationSound Space (TBP ISS). Similar to the Hue-Saturation-Lightness model used for color selec-tion, sounds are arranged in a cylinder, where theradial and longitudinal dimensions are bound tobrightness and pitch, respectively. The timbral(or identity) dimension of the model is restrictedto a small collection of musical sound samplesuniformly distributed along a circle derived fromexperiments in timbre categorization. A pitfall ofthis model is that it’s not clear how to enrich thepalette of timbres, especially for everyday sounds.Recent investigations found that perceivedsounds tend to cluster based on shared physicalproperties.22 This proves beneficial for structur-ing sound information spaces because there arefew sound-producing fundamental mechanismsas compared to all possible sounding objects. Forinstance, a large variety of sounds (bouncing,breaking, scraping, and so on) can be based onthe same basic impact model.12 Bezzi, DePoli, andRocchesso,8 proposed the following three-layersound design architecture:

❚ Identity: A set of physics-based blocks con-nected in prescribed ways to give rise to a largevariety of sound generators.

❚ Quality: A set of signal-processing devices (dig-ital audio effects19) specifically designed tomodify the sound quality.

❚ Spatial organization: Algorithms for reverbera-tion, spatialization, and changes in sound-source position and size.9

Such a layered architecture could help design afuture sound authoring tool, with the possibleaddition of tools for spatio-temporal texturingand expressive parametric control.

Organization and accessThe complexity of the sonic palette calls for

new methods for browsing, clustering, and visu-alizing dynamic sound models. The SonicBrowser (see Figure 4) uses the human hearingsystem’s streaming capabilities to speed up thebrowsing in large databases. Brazil et al.23 initial-ly developed it as a tool that allowed interactivevisualization and direct sonification. In thestarfield display in Figure 4, each visual object

represents a sound. Users can arbitrarily map thelocation, color, size, and geometry of each visu-al object to properties of the represented objects.They can hear all sounds covered by an aurasimultaneously, spatialized in a stereo spacearound the cursor (the aura’s center). With theSonic Browser, users can find target sounds up to28 percent faster, using multiple stream audio,than with single stream audio.

In the context of our current research, we usethe Sonic Browser for two different purposes.First, it lets us validate our sound models, as wecan ask users to sort sounds by moving visualrepresentations according to auditory perceptu-al dimensions on screen. If the models align withreal sounds, we consider the models valid.Second, the Sonic Browser helps users composeauditory interfaces by letting them interactivelyaccess large sets of sounds—that is, to pick,group, and choose what sounds might worktogether in an auditory interface.

Manipulation and controlReal-time dynamic sound models with para-

metric control, while powerful tools in the handsof the interface designer, create important prob-lems if direct manipulation is the principal con-trol strategy. Namely, traditional interfacedevices (mouse, keyboards, joysticks, and so on)can only exploit a fraction of the richness that

49

Ap

ril–June 2003

Figure 4. Sonic browser starfield display of a sound file directory. Four sound

files are concurrently played and spatialized because they’re located under the

circular aura (cursor).

emerges from directly manipulating soundingobjects. Therefore, we need to explore alternativedevices to increase the naturalness and effective-ness of sound manipulation.

The VodhranOne of the sound manipulation interfaces

that we designed in the Sounding Object projectis based on a traditional Irish percussion instru-ment called the bodhran. It’s a frame drumplayed with a double-sided wood drumstick inhand A while hand B, on the other side of thedrum, damps the drumhead to emphasize differ-ent modes. This simple instrument allows a widerange of expression because of the richness inhuman gesture. Either the finger or palm of handB can perform the damping action, or increasethe tension of the drumhead, to change theinstrument’s pitch. Users can beat the drumstickon different locations of the drumhead, thus gen-erating different resonance modes. The drum-stick’s macrotemporal behavior is normallyexpressed in varying combinations of duplets ortriplets, with different accentuation. We tunedour impact model to reproduce the timbral char-

acteristics of the bodhran and give access to allits controlling parameters such as resonance fre-quency, damping, mass of the drumstick, andimpact velocity.

To implement a virtual bodhran, theVodhran, we used three devices that differ incontrol and interaction possibilities—a drumpad,a radio-based controller, and a magnetic trackersystem (see Figure 5). Each device connects to acomputer running Pure Data with the impactmodel used in Figure 1.

Drumpad controller. For the first controllingdevice, we tested a Clavia ddrum4 drumpad(http://www.clavia.se/ddrum/), which repro-duces sampled sounds. We used it as a controllerto feed striking velocity and damp values into thephysical model. The ddrum4 is a nice interface toplay the model because of its tactile feedback andthe lack of cables for the drumsticks.

Radio-based controller. We used Max Math-ew’s Radio Baton24 to enhance the Vodhran’s con-trol capabilities. The Radio Baton is a controldevice comprised of a receiving unit and anantenna that detects the 3D position of one ortwo sticks in the space over it. Each of the stickssends a radio signal. For the Vodhran, we con-verted the two sticks into two radio transmittersat each end of a bodhran drumstick, and playedthe antenna with the drumstick as a normalbodhran. The drumstick’s position relative to theantenna controlled the physical model’s impactposition, thus allowing a real-time control of theinstrument’s timbral characteristics. Professionalplayer Sandra Joyce played this version of thebodhran in a live concert. She found that it waseasy to use and that it had new expressive possi-bilities compared to the traditional acousticalinstrument.

Tracker-based controller. With thePolhemus Fastrack, we can attach the sensors tousers’ hands or objects handled by them, so thatthey can directly manipulate tangible objects ofany kind with the bodhran gestural metaphor.During several design sessions, leading up to thepublic performance in June 2002, we evaluatednumerous sensor configurations. With a virtualinstrument, we could make the instrumentbodycentric or geocentric. A real bodhran isbodycentric because of the way it’s held. Theconfigurations were quite different. We felt abodycentric reference was easier to play,

50

IEEE

Mul

tiM

edia

Figure 5. Three controllers for the Vodhran: (a)

Clavia ddrum4, (b) Max Mathew’s Radio Baton,

and (c) Polhemus Fastrack.

(a)

(b)

(c)

although the geocentric configuration yielded awider range of expression. In the Vodhran’sfinal public-performance configuration, weattached one Polhemus sensor to a bodhrandrumstick (held in the right hand), and theplayer held the second sensor in her left hand.We chose the geocentric configuration andplaced a visual reference point on the floor infront of the player. We used normal bodhranplaying gestures with the drumstick in the play-er’s right hand to excite the sound model. Thedistance between the hands controlled the vir-tual drumhead’s tension, and the angle of theleft hand controlled the damping. Virtualimpacts from the player’s hand gestures weredetected in a vertical plane extending in frontof the player. We calibrated this vertical planeso that if the player made a virtual impact ges-ture with a fully extended arm or with the handclose to the player’s body, the sound model’smembrane was excited near its rim. If the play-er made a virtual impact gesture with the armhalf-extended, the sound model was excited atthe center of the membrane.

ConclusionAn aesthetic mismatch exists between the

rich, complex, and informative soundscapes inwhich mammals have evolved and the poor andannoying sounds of contemporary life in today’sinformation society. As computers with multi-media capabilities are becoming ubiquitous andembedded in everyday objects, it’s time to con-sider how we should design auditory displays toimprove our quality of life.

Physics-based sound models might well pro-vide the basic blocks for sound generation, asthey exhibit natural and realistically varyingdynamics. Much work remains to devise efficientand effective models for generating and control-ling relevant sonic phenomena. Computer sci-entists and acousticians have to collaborate withexperimental psychologists to understand whatphenomena are relevant and to evaluate themodels’ effectiveness. Our work on SoundingObjects initiated such a long-term research agen-da and provided an initial core of usable modelsand techniques.

While continuing our basic research efforts,we are working to exploit the Sounding Objectsin several applications where expressive soundcommunication may play a key role, especiallyin those contexts where visual displays areproblematic. MM

AcknowledgmentsThe European Commission’s Future and

Emergent Technologies collaborative R&D pro-gram under contract IST-2000-25287 supportedthis work. We’re grateful to all researchersinvolved in the Sounding Object project. For thescope of this article, we’d like to acknowledge thework of Matthias Rath, Federico Fontana, andFederico Avanzini for sound modeling; EoinBrazil for the Sonic Browser; Mark Marshall andSofia Dahl for the Vodhran. We thank Clavia ABfor lending us a ddrum4 system and MaxMathews for providing the Radio Baton.

References1. C. Roads, The Computer Music Tutorial, MIT Press,

1996.

2. D. Rocchesso et al., “The Sounding Object,” IEEE

Computer Graphics and Applications, CD-ROM sup-

plement, vol. 22, no. 4, July/Aug. 2002.

3. W.M. Hartmann, Signals, Sound, and Sensation,

Springer-Verlag, 1998.

4. G.B. Vicario et al., Phenomenology of Sounding

Objects, Sounding Object project consortium

report, 2001, http://www.soundobject.org/

papers/deliv4.pdf.

5. W.W. Gaver, “How Do We Hear in the World?

Explorations in Ecological Acoustics,” Ecological Psy-

chology, vol. 5, no. 4, Sept. 1993, pp. 285-313.

6. C. Ware, Information Visualization : Perception for

Design, Morgan-Kaufmann, 2000.

7. P.J. Stappers, W. Gaver, and K. Overbeeke, “Beyond

the Limits of Real-Time Realism—Moving from Stim-

ulation Correspondence to Information Correspon-

dence,” Psychological Issues in the Design and Use of

Virtual and Adaptive Environments, L. Hettinger and

M. Haas, eds., Lawrence Erlbaum, 2003.

8. M. Bezzi, G. De Poli, and D. Rocchesso, “Sound

Authoring Tools for Future Multimedia Systems,”

Proc. IEEE Int’l Conf. Multimedia Computing and Sys-

tems, vol. 2, IEEE CS Press, 1999, pp. 512-517.

9. U. Zölzer, ed., Digital Audio Effects, John Wiley and

Sons, 2002.

10. J.F. O’Brien, P.R. Cook, and G. Essl, “Synthesizing

Sounds from Physically Based Motion,” Proc.

Siggraph 2001, ACM Press, 2001, pp. 529-536.

11. K. van den Doel, P.G. Kry, and D.K. Pai, “Foley

Automatic: Physically-Based Sound Effects for Inter-

active Simulation and Animation,” Proc. Siggraph

2001, ACM Press, 2001, pp. 537–544.

12. D. Rocchesso et al., Models and Algorithms for

Sounding Objects, Sounding Object project consor-

tium, 2002, http://www.soundobject.org/papers/

deliv6.pdf.

51

Ap

ril–June 2003

13. M. Back et al., “Listen Reader: An Electronically

Augmented Paper-Based Book,” Proc. CHI 2001,

ACM Press, 2001, pp. 23-29.

14. R. Dannenberg and I. Derenyi, “Combining Instru-

ment and Performance Models for High-Quality

Music Synthesis,” J. New Music Research, vol. 27,

no. 3, Sept. 1998, pp. 211-238.

15. R. Bresin and A. Friberg, “Emotional Coloring of

Computer Controlled Music Performance,” Comput-

er Music J., vol. 24, no. 4, Winter 2000, pp. 44-62.

16. A. Friberg and J. Sundberg, “Does Music

Performance Allude to Locomotion? A Model of

Final Ritardandi Derived from Measurements of

Stopping Runners,” J. Acoustical Soc. Amer., vol.

105, no. 3, March 1999, pp. 1469-1484.

17. R. Bresin, A. Friberg, and S. Dahl, “Toward a New

Model for Sound Control,” Proc. COST-G6 Conf.

Digital Audio Effects (DAFx01), Univ. of Limerick,

2001, pp. 45-49, http://www.csis.ul.ie/dafx01.

18. A. Friberg, J. Sundberg, and L. Frydén, “Music from

Motion: Sound Level Envelopes of Tones Expressing

Human Locomotion,” J. New Music Research, vol.

29, no. 3, Sept. 2000, pp. 199-210.

19. R. Bresin et al., Models and Algorithms for Control of

Sounding Objects, Sounding Object project consor-

tium report, 2002, http://www.soundobject.org/

papers/deliv8.pdf.

20. B. Truax, Acoustic Communication, Ablex Publishing,

1984.

21. S. Barrass, “Sculpting a Sound Space with Informa-

tion Properties,” Organised Sound, vol. 1, no. 2,

Aug. 1996, pp. 125-136.

22. J.M. Hajda et al., “Methodological Issues in Timbre

Research,” Perception and Cognition of Music, J.

Deliege and J. Sloboda, eds., Psychology Press,

1997, pp. 253-306.

23. E. Brazil et al., “Enhancing Sonic Browsing Using

Audio Information Retrieval,” Proc. Int’l Conf. Audi-

tory Display, ICAD, 2002, pp. 132-135.

24. J. Paradiso, “Electronic Music Interfaces,” IEEE Spec-

trum, vol. 34, no. 12, 1997, pp. 18-30.

For further information on this or any other computing

topic, please visit our Digital Library at http://

computer.org/publications/dlib.

Davide Rocchesso is an associate

professor in the Computer

Science Department at the

University of Verona, Italy. His

research interests include sound

processing, physical modeling,

and human–computer interaction. Rocchesso received

a PhD in computer engineering from the University of

Padova, Italy.

Roberto Bresin is a research assis-

tant at the Royal Institute of

Technology (KTH), Stockholm.

His main research interests are in

automatic music performance,

affective computing, sound con-

trol, visual feedback, music perception, and music edu-

cation. Bresin received an MSc in electrical engineering

from the University of Padova, Italy, and a PhD in

music acoustics from KTH.

Mikael Fernström is director of

the master’s program in interac-

tive media at the University of

Limerick, Ireland. His main

research interests include interac-

tion design, sound, music, and

electronics. Fernström studied electronic engineering

and telecommunications at the Kattegatt technical col-

lege in Halmstad, Sweden, and has an MSc in

human–computer interaction from the University of

Limerick.

Readers may contact Davide Rocchesso at the

Computer Science Dept., Univ. of Verona, Strada Le

Grazie 15, 37134 Verona, Italy; Davide.Rocchesso@

univr.it.

52

IEEE

Mul

tiM

edia

Documents

Computational Media Aesthetics Sounding Objects