Toward the Design of Virtual Sound Sculptures.pdf

Towards the Design of Virtual Sound Sculptures : a Tool for

Multiple Sources 3D Spatialization

Sonic Art Reasearch Center - Queen’s University Belfast

16 septembre 2013

1 Introduction

This report presents a system for the spatialization of multiple sound sources in 3D. The tool thatwe offer allows to simulate the positionning and displacement of multiple sound sources in 3D Ambisonicsfrom 1st to 5th order, and to convert them into binaural format. After exposing the background and mo-tivations of this work, I will detail the use of this mixt spatialization technique and the architecture of thesystem. I will then give a brief description of the audio examples that were produced while experimentingthe system, to finish with an overview of the further works that will contribute to its improvement.

2 Background

The last decades have seen the emergence of a variety of efficient spatialization techniques, such asVector Based Amplitude Panning, Ambisonics, Wavefield Synthesis, or Binaural synthesis. Composers,sound artists, sound designers of all orientations are making an increasing use of spatial audio, be itas a means to enhance some properties of auditory scenes, to develop a fully space-oriented languageor to create auditory scenes in two or three dimensions. In this regard, the question of accessibility tofacilities offering multichannel systems and spatial audio systems is problematic. The reception of theauditory scene by an audience in a physical space is also problematic, as the quality of the restitutiondepends not only on the system being used, but also on the location of the listeners in space. Duringthis master, I had the opportunity to focus on the auditory perception of shapes [1, 2] and distanceassessment [3, 4], which fed an interest for the design of virtual sound sculptures in real or simulatedspaces. More specifically, I was particularly interested in the effects and advantages of binaural renderingand the efficiency of Ambisonics for 3D rendering. I hence wished to develop a tool serving the purpose ofan installation for the simulation of three dimensional sound structures in space. This tool had to allowfor an independent work from spatial audio facilities and provide a satisfactory rendering without anyrestriction for a listener, regardless where he or she would be situated into space.

3 Motivations

The initial aim of my project was to produce a sound installation that dealt with the generationof sound volumes 1of various shapes and dimensions in a naturally reverberated environment.

However, after initial explorations it became evident that I did not have an adequate set of toolsfor the realisation of such a piece. The project than shifted to the creation of such a toolkit that wouldallow me to realise not only the installation I originally intended but that could be adapted to multiplescenarios. Since designing a generalized system making all possible interactions easy is a complex andtime consuming task, I limited the system components of this first iterations to specifically deal with theissues raised by my initial installation idea. I must stress that although some examples of sound materialswill be given in section 6, what is being presented here is not a finished installation but a spatialisationtoolkit.

The following subsections present a detailed overview of this ideal installation and will help clarifysome of the design criteria.

1. the terms of sound shapes and sound volumes respectively apply to bidimensional or three dimensional sound

structures

1

3.1 Installation overview

3.1.1 Explore the Source and Space Dialectics

This installation aims to explore the dialectics and frontiers between space and sources in spatialauditory perception. Indeed, those two categories interact in the perception of sound morphologies, andcannot be considered isolatedly. Such an installation aims to produce volumes of various density, spectralcharacteristics, spatialized to provoke either a subtle and diffuse sense of space or to yield discrete soundentities. Moreover it aims to explore the continuum, the tenuousness of the frontier between those twotypes of auditory representations. This could be achived by the generation of ambiguous morphologiesbelonging potentially to both fields of space and source, but also to to play with the cognitive thresholdthat separates these two perceptive categories. Another way of achieving such a result would be to modifythe relative distance and location of a stream, in order to put a sound entity into two different perspectives.

3.1.2 Induce Different Listening Modes

The listener can then experience a succession of sound entities, but also adopt various listeningmodes, succeeding each other through sequences overlapping gradually. In this continuous succession,attention can fluctuate between the two categories of source and space, and the listener can either expe-rience a sense of immersion, or a sense of distance towards the auditory scene. In the temporal domain,the listener can experience sonic events on the macro and microscopic scale, through continuity and dis-continuity. In the spatial domain, discrete sound structures, where the sense of source and space coexist,can evole towards or contrast with hazy mists of sound particles through which those two categories can-not be distinguished anymore. Such an organization of sound streams can incite the listener to considerone single sound entity according to a micro or macroscopic listening, from both spatial and temporalstand point, and therefore modify its very perception.

3.1.3 Building a Virtual or a Physical Space : Use of Ambisonics and Binaural Techniques

Two spatialization techniques can be considered for the audio rendering. Using Ambisonics, soundsare projected in a physical space with a set of speakers, while the combination of Ambisonics and binauraltechniques yields a virtual space. A change of rendering space can contribute to recontextualize theauditory material and to compare changes in congnitive and perceptive effects. It can also underline theimportance of the relationship between sound and space in the construction of a sound environment fromboth virtual and physical point of view.

A physical projection space for the sounds would emphasize the long term dimension of the ins-tallation and incite the listener to have a listening flexibility.

3.2 Design of a set of tools

The installation project informed the design of the set of tools, which comprises three main com-ponents.

3.2.1 Components Overview

1. The production and control of a sound stream constituted of a number of point sources. The densityof this stream should be able to vary from a small amount of grains, translated into discrete sourcesfrom a perceptual point of view, to a large amount, translated into a continuous texture. To thisregard, granular synthesis seemed relevant.

2. The spatialization of the stream as a coherent whole, allowing for a direct control of the position-ning and displacement of each point source. A flocking algorithm, fed by behavioral parameters,appeared as a convenient means to generate three dimensional patterns of point sources. Such atool was all the more so appealling as it yielded a large scope of different morphologies whosecharacteristics would evolve organically and gradually in time and in space. Last, it represented agreat computational advantage : the positionning of each source resulted from those patterns andwas restituted downstream by the algorithm, exonerating me from computing the positionning ofindividual sources.

3. A flexible multichannel spatial audio rendering programme allowing for the conception of auditoryscenes independently from the use of spatial audio facilities. In this regard, I chose to associateAmbisonics, an efficient spatial audio technique, to binaural, to get a good approximation of abroadcasting in a real space, with simple headphones.

Last, an important task was to connect these three components and to limitate the CPU consumption,as both the audio rendering and the control of the spatialization can require an important computation.

2

3D AmbisonicsFirst Order 4

Second Order 9Third Order 16Fourth Order 25Fifth Order 36

Table 1 – Number of speakers required in 3D Ambisonics for the five first orders

3.2.2 Production of Sketches

A set of modules was added to the ensemble to explore various relationships between the controlof sources positionning and the stream properties. From then on, various types of auditory scenes weresketched to serve as a basis for a sound intallation. A description of this modules is given with theexamples of application of the programme.

4 A Mixt Spatialization Technique

The spatialization module associates Ambisonics and binaural techniques. After a brief overviewof their characteristics, I will expose the reasons and interests of this association, its advantages andconstraints, and its implementation.

4.0.3 Ambisonics Technique Overview

Ambisonics is a very efficient spatial audio technique for the simulation of auditory scenes spa-tializing sounds in azimuth, elevation and distance. Ideally, it requires the even distribution of speakersover a sphere of constant radius. The higher the order is, the more accurate is the spatialization, but thehigher the number of required speakers is. The number of speakers (NS) according to the order (N) issuch as NS = (N + 1)2.

The multiple signals are encoded according to their polar coordinates, and decoded alltogether tobe panned over the speakers.

4.0.4 Binaural technique Overview

Binaural technique is based on head related transfer function (HRTF) measurements - filtering ofthe signal by the external ear -. HRTF measurements are not continuous but occur generally every 10or 5 degrees in azimuth and elevation, and at a fixed distance, around 2m most of the time. In return,the sound source localization is simulated by the filtering of the signal by each HRTF. Note that mostHRTF banks do not include filters for low elevations. The use of non individualized HRTFs might leadto variable results depending on the listeners, but a preparatory review suggests that overall satisfactoryresults can be achieved. Binaural listening allows for a sound source spatialization independently fromany facilities. Although it restricts the auditory experience to the virtual domain and individual listening,it allows to bypass the limitations of the sweetspot, which also is a drawback for a collective listening.

4.0.5 Advantages of the Ambisonics to Binaural Conversion

The simulation of moving sound sources without speakers cannot be achived with the binauraltechnique alone. This technique would be far too CPU consuming and is too restrictive in terms ofspatialization. as long as a sound spreads in space and moves, it has to be panned over the HRTFs, whichis where ambisonics comes into play. Besides, spatializing multiple sounds would require as many HRTFbanks as there are audio signals, which means an important amount of processing. This constraint canalso be bypassed with the use of Ambisonics, which allows to yield a single audio stream of decodedsignal, which can be routed towards the adequate HRTFS.

Associating ambisonics and binaural allows for a satisfactory simulation of auditory scenes inAmbisonics, for various orders and virtual speaker layout, without having access to any facility. It makespossible the upstream comparison of the rendering of auditory scenes possible before a broadcastingwith a real set up, and of course allows for a binaural listening to the audience. The conversion is anadvantageous means to freely assess the quality of auralization comparing several ambisonics orders,formats, speakers layouts, without material or sweetspot constraints.

Therefore, this association is convenient in practical terms as much as in terms of exploration ofcreation with spatial audio.

3

Figure 1 – Representation of the speakers layout computed for the 5 Ambisonics orders.

4.0.6 Speaker layouts calculation

The position of each speaker was calculated with the 3LD Matlab library [5], which allows to gene-rate even layouts from platonic solids or geospheres, whose vertices are evenly spaced . The computationof the azimuth and elevation of each vertice of a chosen model can then be converted into a speakerposition. The following graphic representations account for the regularity of the layouts calculated forthe five ambisonics orders.

The table below shows the vertices coordinates computed for the five orders.

4.0.7 Ambisonics/Binaural Techniques Compatibility Issues

Associating Ambisonics and binaural techniques implies the coexistence of two different «spacediscretisation systems» : the speaker layout, and the sampling of space of the HRTF measurements.There exist a number of incompatibilities between the two systems with respect to the even spreadingof speakers, the number of speakers, their even distribution and the sampling of the spherical spaceimplied by the HRFTs measurements. So far, I have been using the MIT’s Kemar HRTFs, characterizedby an almost even discretisation by 10˚ elevation from -40˚ and 5˚azimuth, with exceptions for 30˚,-30˚and 50˚elevation where measurements were done at slightly different azimuths. This affected thevirtual speaker layout design as well as the routing of the decoded signals to the HRTFs and implied agreater flexibility of the conversion algorithm in order to allow for the use of various approximations ofspeaker layouts.

The computed values were not always strictly compatible with this set of HRTFs, especially interms of elevation. A number of compromises had to be found to get the best approximations.

When the number of vertices was superior to the number of speakers required, the poles wereignored (-90˚ and + 90˚, with a priority given to the upper pole, as there are no available HRTFs below-40˚). The elevation values of the speakers had to be compressed, expanded, shifted up or down, rotated(order 1 only) or fine tuned (order 5 only) by hand. The modification didn’t excede 3 or 4 degrees. Some ofthe azimuth values of the 5th order configuration also had to be slightly modified to match the availableset of HRTFs. Hence there is a necessary treadeof between the greater precision of the spatialization withthe increasing Ambisonics order as well as the number of dimensions, and the approximations impliedby the Ambisonics to binaural conversion. Besides, all signals below this level are routed towards HRTFsof -40˚ elevation. The corresponding values were compiled and stored under the form of decoder presetsand can be selected dynamically via the programme interface.

5 System Architecture

A stream of point sources of variable density is generated from an audio file by a granular synthesisalgorithm. This stream can have variable density and is spread in the three dimensional space around alistener.

The patterns morphology results from the number of point sources, the values of the behavioralparameters of the flocking algorithm, and the positioning of the attraction points towards which the flockis attracted.

4

Figure 2 – Architecture of the system

The spectral characteristics of each point are determined by the mapping of the position of thesources in space to a temporal location in the sound material and the sound material properties itself.The spectral characteristics of the stream result from its density, the duration, periodicity of grains aswell as from their individual spectral charcteristics. Subsequently, these characteristics interact with thespatialization, and are also modified to a lesser extent by the reverberation.

The audio rendering is in 3D Ambisonics multichannel format, converted into binaural signals, andis carried by an Ambisonics to binaural conversion algorithm. Various ambisonics and speaker layout canbe simulated, and specific layouts can also be defined by the user.

The following sections describe the design of the system’s components and their articulation :the stream generation engine, the positionning module based on the flocking algorithm, and the audiorendering module associating Ambisonics and binaural techniques.

5.1 Sound Input

5.1.1 Granular synthesis engine Overview

The granular synthesis engine was built with gabor objects from IRCAM’s FTM library. Theparameters of the engine are the grains periodicity, position, position variation, and duration. At thedesired periodicity, a chunk of sound is sent through an audio channel. The multiple outlets allows forthe overlapping of the grains and the contruction of a variety of textures.

5.1.2 Integration in the System

One approach would be to have a single engine, and spread the texture produced over the ensembleof the sources, routing each grain of sound to the one encoder input. Therefore, the same parametric valuesare applied to all the grains before they are spatialized. I assumed that an homogenuous texture wouldbe spread into space, with a completely different effect, which might have been an interesting approach.Nevertheless, it would have been necessary to design an additionnal module to govern the order andfrequency of feeding of each source dynamically. Otherwise, a constant frequency and order could havelead to undesirable effects in the case of still patterns of sounds for instance. Such a module couldprobably have been successfully integrated in the system, but it implied a radically different approach ofthe relationship between spatialization and spectral properties, as well as more programming time.

I sought to control the properties of each source independently with respect to its positioning. Thischoice had several targets.

1. Controling the spectral characteristics and periodicity of each source with respect to its positioninto space, to allow for the discretisation of sources.

2. Controling the degree of homogeneity of the auditory scene with respect to the patterns morphology.3. Investigate the relationships between volumetric patterns and sound material.

5

From the synthesis stand point, this means that each point source is a construct of grains characterizedby a specific texture. To control the characteristics of the sources individually, the synthesis engine wasintegrated into a [poly~], to get one synthesis engine per source. This strategy is quite costly in terms ofCPU, but it seemed relevant with respect to the project. It is then the choice of the user to programmethe positionning to synthesis mapping to control the global characteristics of the stream, at a higher level.This task was achieved by the design of additional modules for the installation sketches.

5.1.3 CPU Limitations

The use of multiple instances is very CPU consuming, even with a [poly~]. Due to the instabilityof the gabor objects, the number of instances of synthesis engines has been fixed to a maximum of 30(maximum number of boids) and cannot be changed dynamically. In order to save computation time,instances that are not used are muted, that is, ignored by the processor. This can be the case, when thenumber of boids is inferior to 30, or when the periodicity and duration of the grains are such that nosignal is output by the instance are muted. Nevertheless, this process will not be effective using a highnumber of grains with short periodicities and long durations, as signals are permanently produced byeach instance of the engine.

5.1.4 Implementation for the Installation

The polar coordinates of each point are sent from the flocking algorithm and routed toward a setof modules that were designed to create the installation sketches. These modules interprete the positionof the points in terms of position, variability of position, duration and periodicity in various ways, inorder to yield a variety of textures. More generally, the mapping of the points positions to the grainproperties should remain at the discretion of the user, as this component of the programme is tightlyinterdependant with the variability of the sound material and another component of the system, thecontrol of the patterns of points in space. The setting of fixed relationships between these componentsrequires a systematic approach in the control of synthesis by the positioning of sources and overallcharacteristics of patterns. Such an approach requires practice and expertise in the domains of synthesisbut also psychoacoustics and cognition.

5.1.5 Sound Material

The properties of the sound material and its use should induce a sense of spatial cohesion suitablefor the production of a global coherent sound volume. This excludes out of hand any material bearinga causal, concrete, vocal or instumental reference. Such a material can be generated thanks to severalsynthesis techniques, among which physical modeling and granular synthesis could be privileged.

An audio file stored in a buffer is used as a reservoir for the synthesis of grains of various durationsand periodicity. Note that the buffer that I used allows for dynamic changes of contents, which I exploitedfor our sketches, but also allows crossfades and interpolations of various sound files.

The choice of the sound material that feeds the granular synthesis engine has been restricted to afixed audio file, excluding the possibility of processing a live audio stream. This choice was justified by thenecessity to dedine the sound properties at a constant in the generation of the stream, while expoloitingthe spectra evolution in time . An audio file represents a source whose characteristics are set, while a liveaudio material might become too unstable and evolve dynamically. The user can then use wahtever initalmaterial can seem appropriate as a means to characterize the sound environment by a fine mapping ofthe points position in space and flock behaviour on the grains position in the sound file, duration andperiodicity. These parameters will determine the properties of the individual sources as well as the globalstream properties. Knowing the sound material hence appeared as the most straightforward way to havea control upon the properties of the material yielded by the synthesis. Later in this report (examples ) isgiven an overview of the sound material that was chosen and its use for the creation of the installationsketches.

The realisation I am presenting uses an instrumental sound that appeared as a relevant means tocharacterize and delineate of the sound space : I chose to use a double bass harmonic glissando to disposeof a continuum of sound with progressive spectral changes, but also to expoit non linearities, accidents,and noisy textures for some of the sketches. In order to characterize the space by the means of the soundmaterial, I applied a mapping of the azimuth to selected chuncks of the sound file, or used very shortsuccessive chunks to create a sense of pitch and densify the grain textures. In some cases, I also usedpartial tracking and filtering to get smoother textures. The effect that were produced account for theimportance of the spectrum properties in the production of textures, but they also reveal that disparitiesof intensity in the spectrum in the time domain could result in missing point sources, which can be amajor issue, or on the contrary, exploited in a creative manner.

6

5.2 Positioning System

A number of strategies could have been employed to design patterns of point sources in threedimensions. I opted for a flocking algorithm. Flocking algorithms simulate emergent behaviours whicharise out of a multiplicity of interactions between individual semi-independent agents, such as in birdflocks, or mammal herds, but also in biological or non biological systems. These types of behaviours rulethe displacement of points into space with respect to their neighbours and an attraction point.

Emergent behaviours are used as a means to create morphologies whose characteristics wouldevolve organically and that would therefore be easier to apprehend by listeners. Here, the flock representsa complex continuous sound entity, endowed with a global spatial "behaviour", while individual memberof the flocks represents elementary sounds. Therefore, it is a convenient means to generate individualcoordinates for each source from a global pattern.

Initial explorations by manipulation of the parameters showed that the algorithm could yield agreat variety of very distinct patterns, such as surrounding clouds of points, isolated particles, motions,discrete swarms, multiple groups of points, moving at various speeds or absolutely still. I chose to exploitthose distinct patterns separately, to investigate the effects of various sound textures on the resultingsound stream. Note that the effect of these parameters varies dramatically depending on the number ofpoints input in the algorithm. Besides, as the number of points also affects the synthesis, it is ot alwayspossible to control the sound textures independently from the spatial patterns.

5.2.1 Implementation in Max MSP

There exist a number of flocking algorithms with implementations in various programming en-vironments, such as Matlab, Processing or Max MSP. We used a Max MSP implementation of CraigReynolds’ flocking algorithm by Eric Singer, which includes a visual representation.

This version of the algorithm uses thirteen behavioural parameters :1. number of neighbours each boid consults when flocking2. centering instinct3. attraction towards the attraction point4. speed of acceleration5. minimum speed6. maximum speed7. overall speed8. neighbours avoidance9. preferred distance between neighbours

10. neighbours speed matching11. avoidance of the virtual limits defined by the scaling of space12. distance of vision for avoid the space limits13. intertia

5.2.2 Use in the context of the project

Minor modifications were made to the implementation in order to control the visual rendering inthree dimensions and manipulate a limited number of attraction points directly via a graphic interface.This step allowed me to make a series of tests and determine behavioural presets used to generate specificpatterns used for the installation sketches, out of flocks of different densities (2, 4, 8, 12, 24 and 30 boids).Those patterns were interesting metaphores for the definition of sound textures, but I’ve strived to bypassany reference to the visual representation, as it biases auditory representations.

5.2.3 Design of a System of Attraction Points

I designed a module to use different attraction points placed at various distances from the listener.This system is able to :

– modulate the propagation space– delineate the space and distance of the streams– switch from an immersive to a distanciated perception with the same sound morphology– switch from a source oriented to a space oriented listening mode

7

Figure 3 – Representation of the 11 attraction points in plane and frontal view on a 100m scale by the[ambiencoder~]

Such changes can once again be achieved in a contrasted or continuous manner, with respect to theselection of the attraction points, the boids behaviour and the synthesis parameters.

The calculation of the motion in the cartesian space with respect to time t is based on the formulas :x = costp

1+a

2t

2

y = sintp1+a

2t

2

z = � atp1+a

2t

2

The number of attraction points as well as their speed and individual motion can be modifiedvia the subpatch interface (see the patcher documentation for further details). For further details on theimplementation of this formula, look at the patcher documentation.

5.2.4 Application

The current implementation comprises 11 attraction points following a spiral motion on the surfaceof 11 spheres, whose radius varies from 1 to 100m. In the context of a possible installation, I had the radiussize evolve progressively between two boundaries, in order to delineate modulated spaces and irregularindividual motions for each point. The speed in azimuth and elevation as well as the direction of motionof the points are determined individually, in order to get various patterns of points in space and time,contrary and parallel motions. In a similar approach, the boundaries of the radius of the spheres havebeen set so that the eleven spheres are encapsulated and overlap. As a result, the distance between thepoint and the listener varies, and one can obtain crossings between the points in distance, azimuth andelevation. The speed of the points tends to increase with distance to avoid fast displacement in the nearfield. Indeed, in the eventuality that the boids might follow a close attaction point, unpleasant physicaleffects might be caused by the fast and continuous panning of the sounds in the three dimensions.

The position of the attraction points is represented in the xy and xz planes on an [ambimonitor],which allows to test the affect of different positionning of the streams. For the purpose of the sketches,

8

Figure 4 – Representation of the boids in plane and frontal view by the [ambiencoder~]

I have integrated a module for the automatic selection of the attraction point. But this selection can bedone manually and this module can be desactivated.

5.2.5 Routing of the boids coordinates to the encoder

The cartesian coordinates output by the algorithm boid by boid are converted into polar coordinatesby an [ambimonitor]. These coordinates are routed toward the the same input as the sound source resultingfrom the mapping of the boid coordinates to the synthesis parameters.

5.3 Signals Encoding and Decoding

5.3.1 Patcher overview

ICST’s [ambiencode~] encoder allows to process multiple signals. It is associated with Peter Stitt’s[sarcoder] decoder.

The choice of an order and layout triggers the loading of the appropriate encoding and decodingpresets. The maximum number of signals is set to 30, but can be modified by creating additionnalconnections in the patch manually. A set of modules changes the routing of the signal and polar coordinatesfrom the granular synthesis engine and the boids to the binaural conversion module via the encoder anddecoder dynamically as well.

5.3.2 Decoder and Encoder Parameters

Decoding presets consist in a list of speakers, azimuth and elevation values, as well as format,weighting and listening options. Only the number and layout of speakers should be modified.Encodingpresets correspond to the standard [ambiencode~] presets, for various ambisonics orders. The user canexperiment on the effects of the encoder parametric values but the order, format, weighting and listeningoptions should always corrspond to the decoder parameters. The presets are stored as .xml files, but newvalues can be input and stored manually within the patcher via the two speaker layout storage modules.All details for the use and behaviour of these objects are provided in the patcher documentation.

9

Figure 5 – Interface of the Encoding and decoding patcher

10

Figure 6 – Decoder values after the choice of a layout in 5th order

5.3.3 Distance rendering

The sphere radius argument, or scaling of cartesian coordinates can dramatically change the senseof space, distance, and «geographic» limits of the rendering. With a value of 10, sources will be consideredas 10 times closer, while with a value of 0., they will be considered 10 times more distance. This parametricvalue has been predefined but can also be changed by the user within the patcher. Note that the scalingof distance of sounds within the head is set on a standard value, but the effect of this parameter can beexplored by the user through the [ambiencoder] documentation.

5.3.4 Reverberation Module

From the review of papers relating to distance perception [3, 4] it appears that reverberationis the main absolute and relative cue in distance assessment. It is an absolute cue in that it gives aninstant information about the distance of sources, regardless any prior learning of the objects or spacecharacteristics. It is also a relative cue in that it informs us on the relative position of sources in space.Therefore a simple reverberation module has been integrated in the programme and applied downstreamto the binaural signal. The parameters of the reverberation can be set ad libitum by the user.

5.4 Articulation of the System Components

The global system would not be functionnal without ensuring the adequation between positionning,spectral properties and 3D audio rendering of each source.

The cartesian coordinates of each boids are converted into polar coordinates by the ICST [am-bimonitor]. These coordinates are sent to a corresponding instance of the synthesis engine and to theencoder. For example, the coordinates (aed n) are sent to the [synthesisengine n] and to the encoder via[send grainn].

An audio channel number is assigned automatically at the output of the engine depending on therank of the instance. Keeping on with our example, this would be [send~ grainn]. The signal and polarcoordinates is routed to the corresponding audio input of the audio rendering module, [receive~ grainn].Finally, all the signals are encoded and decoded to be restituted and converted as a single audio streamin binaural format.

This technique of routing garantees the adequation between the positionning of a point source, theproperties of the grain that represents this point source, and the rendering of this position into space.

5.5 Ambisonics to Binaural Conversion Module

The following paragraph gives a brief description of the conversion module integrated in the system,which is completed by the patcher documentation. Further details about the properties of these techniquesand the reasons of their use were given in section 4.

The general principle behind the conversion of signals from Ambisonics to binaural is the routingtowards the left and right HRTFs that correspond to the azimuth and elevation of the encoded signal. Ina nutshell, once a signal is decoded, it is input into an HRTF stored just like an audio file in a [buffir~].This HRTF acts like a filter. With such a system, each decoded signal is assigned a left and right HRTFwith the same azimuth and elevation values. This implies that as many HRTFs as necessary to cover allthe speaker layout possibilities, for the five orders, should be available and selectable on demand. Besides,

11

Figure 7 – Conversion module in presentation mode

12

Figure 8 – A discrete homogenous pattern of points (example 4) and an heterogeneous surroundingpattern of point (example 6), at a large 20m scale.

as one HRTF per ear is needed, twice as many HRTFs as there are speakers are required. Thus, on thewhole, I had to manage a bank of 160 left and right HRTFs.

For this reason, I designed one single conversion module that I integrated under the form of a[poly~]. The number of instances of this module is changed dynamically depending on the selected order.For example, 32 instances will be generated for the third order, and 72 instances for the 5th order. TheHRTFs have been classified by growing azimuth and elevation, an given an index. To each speaker layoutcorresponds a list of indexes that represent the list of HRTFs. When an Ambisonics order is selected alongwith a speaker layout, a series of matrices operates in order to select the appropriate HRTF combination.Then each HRTF is assigned to the right [poly~], in wich the decoded signal is input. Last, the convertedsignals are gathered into a left and right audio output.

6 Commented Examples

I have produced a series of examples showing how I started exploring the possibilities of the systemfor an installation. There exist a great number of possible application of the system, therefore I intendto use these examples to expose how I exploited some of the metaphores that were inspired by the useof the flocking algorithmm some of the relationships between texture and space I tried to develop. I willalso mention some constraints and issues I encountered.

One will notice that many of these examples were inspired by the visual representation of the flo-cking algorithm. Although I am particularly aware of the problems raised by sterile attempts to reproducevisual sensations by auditory sensations, I should specify that I always strived to develop an auditoryidiomatic approach in the manipulation of sounds and in the use of granular synthesis. Last I would addthat some of the visual metaphores that were suggested to me also carried an auditory content that Itried to emphasize.

In most cases, I used a system of random rhythm generations to control the periodicity of thegrains, while mapping the durations of the grains according to the distance and their position in theaudio file according to their azimuth or elevation, according to different modalities.

1. Example 1 - track 1This texture was suggested to me by the representation of patterns of points that seemed to evolveas ribbons oving quiclky into space and that would sometimes look as if it was being teared. I pickedchunck of the glissando characterized by frictionnal sounds, corresponding to the beginning of theinstrumental gesture, when the bow noise predominates on the pitched vibration of the string. Ipossibly was influenced by the very sound of the bow rubbing the surface of the table, which cansound like the soft sound of a ribbon. This part of the sound also shown interesting spectral changesthat reminded me those texture variations that occur when a sheet of paper or a piece of cloth isteared in two. I tried to create this effect by a mapping of the distance with the duration of thegrains, but the continuum I got was nor effective nor satisfactory. I therefore arbitrarily split thegrains in two overlapping streams a over two distinct ranges, from 10 to 50 ms, and from 80 to1 80ms. In order to obtain quite a full texture, I used 30 grains and split them into two different streams.Within those streams, the duration of the grains would constantly grow from 1 to 70 ms and from65 to 150 ms without any relationship with the distance. Consequently, I constructed a texturethat would evolve in time but without relationship with the stream’s spatial properties. Although

13

I obtained a texture that I found evocative and an interesting spatialization effect, I assume thata more convincing result could be achieved by a fine control of this relationship between polarcoordinates and synthesis. One must note that an important issue was the difference of loudnessthat occurs with different grains durations. Such an effect would have affeced the sense of distancethat was supposed to arise from the large displacement of the stream. This led me to integrate theamplitude scaling module I previously mentionned in the section 5 of this report.

2. Example 2 - tracks 2 -3Here, static clusters of 24 points is spread into space at more of less close distance. This statism isunderlined by the slow periodicity of each grain. I sought for a sence of cohesion in the grains colorby mapping the distance to a grain duration of 1 to 8 ms, which yielded a gradually growing senseof rugosity. This had for effect to reduce the presence of close grains and increase the presence ofmore distant grains. Although the whole sound file is used to represent the spatial perimeter, thiseffect is very limited by the short duration of the grains. I chose such a compromise as my approachwas pointillist. One will notice that we have trouble getting a continous evolution between nearand far field, and that internalization dominates over the sense of distance of points in an externalspace. This might be due to the HRTFs, or to the scaling of the encoder.

3. Example 3a and 3b - tracks 4 -5In the first example, a couple of points are attracted towards each other and collide, like particles,or flies. The attraction speed is far inferior to the repulsion speed. Once again, I was probablyinfluenced by the visual representation, but tried to determine the parameters of the synthesis toreinforce the perception of motions, crossings and speed. Here, I chose to map the elevation on thegrain position, in order to underline the vertical positionning of sources into space. Nevertheless, thissense of height appears to mix with a sense of distance, which might result form the vey motion ofthe points. Periodicities and durations are defined to reinforce the sense of distance and proximity :textures evolve from rugosity to smoothness as points get closer to the listener, while periodicitiesoscilate in order to create overlapping or chopped textures. The rendering of distance by the encoderseems particularly effective but could be fine tuned to help increase the perception of a large spaceand reduce the contrast between near and far field. In the second example, 4 boids are used with thesame behavioral parameters and quite similar synthesis parameters. Note that only the number ofpoints was modified for this sketch, leading to completely different behaviour, with parallel motionsin all directions, instead of attraction and repulsion. Here the azimuth is mapped on the position ofthe grains in the sound files. As in the previous example, the perception of individual motions seemsfacilitated by the reduced number of points and perhaps by the synthesis parameters (duration andperiodicity). The contrasts in the distance rendering are less agressive as the speed of the points islesser

4. Example 4 - track 630 points are gathered in lines with a very sharp spatial definition whose position into spaces variesabruptly either by resetting the points position or by changing the attraction point . I chose tounderline this rectilinear aspect giving long durations to all the grains and assigning them to veryshort chunks of sounds, to produce a pitched focalised mass of sound. I noticed that the texturebefore the spatialization is particularly fluid and evocative, while the spatialization process seemsto litterally erode the texture, and therefore affect the perception the stream’s motions in the threedimensions. This is an issue, with the scaling of distance, that I have to adress.

5. Example 5a and 5b tracks 7 - 8In both cases, distance has been mapped non linearly over very short grain duration, in a similarpointilliistic approach as examples 2. In the first case, 8 points are characterized by a great inertiaand move independantly from each othter. In the second case, 12 points are gathered in a denseswarm moving rapidly into space. The discretisation of the points into space seems easier withfewer, and we still face an externalisation and a distance rendering problem.

6. Example 6 tracks 9-10A dense stream of 30 points contracts and expands rapidly into space, moving quite fast from oneattractionpoint to another. Here, the textures are heterogeneous, as durations of the grains are nonlinearly mapped to the distance, to create successive levels in the rendering of space. Although Ihave added a module that limitates the amplitude of the signal with respect to the duration of thegrain, this still seems to affect the perception of distance to a certain extent. Besides, a reverberationeffect tends to occur with long grains, which again interferes with the distance rendering. However,the effect of proximity seems to be reinforced by the production of discrete sources. Note that thesound rendering is affected by glitches due to the CPU consumption.

Note : From a general stand point it seems that externalization is pretty difficult, which might be explainedby the HRTFs.

14

7 Further Works

1. Speaker Layout ComputationHollerweger [5] suggests that uneven speakers layouts could be calculated according to the spatialresolution of the ear, increasing the number of speakers in the areas of smaller just noticeabledifference in azimuth or elevation. This amounts to optimizing the number of available speakers fora given order. I lack information regarding such algorithm or experiments regarding this specificmatter. Nevertheless, I do not exclude to consider a psychoacoustic-based approach to the speakerslayout design in the future.

2. HRTFs QualityThe MIT Kemar HRTF bank goes back to 1984 is very restrictive in terms of sampling of the space,as it is limited above -40˚ elevation and seems to make the externalization of sounds difficult.The CIPIC’s Hrtf bank, which were measured in 2001, or IRCAM’s Listen bank, could be analternative. The Listen bank more specifically is based on measurements for an important corpus ofsubjects, which would allow for interesting comparisons in the accuracy of the binaural rendering.Nevertheless it would require the computation of new speaker layouts, as the space sampling isdifferent from that of MIT’s.

3. Distance renderingDistance rendering is one of the most difficult tasks to achieve with spatial audio. The scaling ofdistance inside and outside the head by the encoder has to be refined, as I found that the thresholdsbetween a sense of proximity and almost imperceptible sounds seemed very close from each other.Smooth changes and a sense of continuity still need to be improved. The use of the attraction pointsand the behavioural parameters can be fine tuned in relation with the sound rendering. There exista number of options for the rendering of distance in the near field, but they don’t seem to becompatible with the Ambisonics technique. Nevertheless, I would like to investigate the possibilitiesfor a better rendering in the near field with HRTFs. I shall mention them as they open interestingprospects.A realtime distance HRTF extrapolation engine has been developed at LIMSI, but itseems that is also includes an ambisonics to binaural conversion. It is not likely that I can integrateit to the system. In 2010, IRCAM developed an extrapolation method for the calculation of HRTFsin the proximity region. In order to assess the precision of the simulation, HRTFs were measuredon a dummy head at eight different distances (20, 30, 50, 70, 100, 136, 170 and 200 cm), resultingin 6920 measurement positions. Another issue would be the consequent implementation time thatwould be necessary to adapt our current conversion patch, which would imply the development ofa distance rendering module in the control of the ambisonics to binaural conversion.

4. Optimization of the Flocking Algorithm and Synthesis QualityA means to obtain smoother transitions has to be found when the number of sources changes, as theboids position is automatically reset. Other flocking algorithms might not have this disadvantage.The implementation of a less CPU consuming algorithm under the form of simple code would allowto get smoother motions of point sources in space, as glitches affect the transmission of data by thealgorithm as soon as 3rd order is used.In the perspective of an installation, improvements should be done in the granular synthesis enginein order to ensure smooth transitions between textures, either by the interpolation of the synthesisparameter in conjunction with the interpolation of the bois position, or by the integration of acrossfade module.

5. Space characterizationWe had used a mapping of the position of grains in the sound file or in chunks of the soundfiles, depending on the azimuth and sometimes elevation of the sources. I hope to use spectralinterpolation of audio files in order to bring changes of spectral color depending on the elevation orazimuth.

6. Stability of the Auditory SceneThe mapping of the sound coordinates to the user’s head coordinates with a head tracker still hasto be implemented to restitute a stable auditory scene.

7. Creative ApproachI hope to develop a more systematic and effective approach of the relationships between soundproperties and spatialization. This requires a refinement of the system and a growing degree ofexpertise.

15

8 Conclusion

This document has described the conception of a toolkit dedicated to the spatialization of multiplesound sources in a three dimension space, using Ambisonics and binaural techniques. This project was theresult of ny interest for the design of virtual sound sculpture, and was intended to make up for the lack ofavailable means to explore this domain out of the facilities provided by the institutions. I have given anoverview of the mixt spatializatoin techniques and explained why these seemed advantageous. I have thendetailed the system achitecture, its components and their articulation, indicating possible constraints orlimits at various levels. Using a series of examples that were produced when starting experimenting withthe system, I exposed possible uses of the system, more particularly within the context of the installationI would hope to develop, pointing out some specific issues related to the rendering of distance. Last, Ilisted a number of improvements that could be brought to the system in general and for the developmentof an installation.

9 References

[1] Zahorik, P., "Auditory Display of sound source distance", Proceeding of the 2002 InternationalConference on Auditory Display, Kyoto, Japan, July 2-5, 2002

[2] Zahorik, P., "Assessing auditory distance perception using virtual acoustics", J. Acoust. Soc.Am., Vol 111, No.4, April 2002

References[3] Potard, G., Burnett, I., "A study on sound source apparent shape and wideness", Proceeding

of the 2003 International Conference on Auditory Display, 6-9 July 2003[4] Potard, G., Burnett, I., "Control and measurment of apparent sound source width and its

applications to sonification and virtual auditory displays", Proceeding of ICAD 04- Tenth Meeting of the International Conference on Auditory Display, 6-9 July 2004[5] Hollerweger, F>, 3LD - Library for Loudspeaker Layout Design A Matlab library for rendering

and evaluating periphonic loudspeaker layouts. IEM-Report 32/06,Institute of Electronic Music and Acoustics, Graz University of Music and Dramatic Arts, 2006.

16

Documents

Toward the Design of Virtual Sound Sculptures.pdf