Citation preview

Romain Boonen, AEDS 911

SAE Institute Brussels

September 2013

Three-Dimensional Hearing:

The Body, The Brain And The


0  FOREWORD ...................................................................................................................................5 1  THE BODY: FUNDAMENTALS OF THE PHYSICS OF SOUND .............................................................7 

1.1. THE PHYSICS OF SOUND : A BRIEF I NTRODUCTION........................................................................ 7 1.1.1.  THE NATURE OF SOUND...........................................................................................................7 1.1.2.  THE IMPEDANCE OF A MEDIUM................................................................................................ 8 1.1.3.  THE FOURIER A NALYSIS ..........................................................................................................8 1.1.4.  LINEARITY OF A SYSTEM .........................................................................................................9 

2  THE BODY: PHYSIOLOGY OF THE EAR ........................................................................................ 10 2.1. THE OUTER EAR .......................................................................................................................... 10 2.2. THE MIDDLE EAR ......................................................................................................................... 10 2.3. THE I NNER EAR AND THE COCHLEA ............................................................................................. 13 

2.3.1.  A NATOMY OF THE COCHLEA.................................................................................................. 13 13 ESSENTIAL MECHANICS OF THE COCHLEA ............................................................................ 14 THE ORGAN OF CORTI ............................................................................................................14 2.3.2.  PHYSIOLOGICAL FUNCTIONING OF THE COCHLEA.................................................................. 14 2.3.3.  SCALAE AS FLUIDS COMPARTMENTS: PERILYMPH AND E NDOLYMPH ....................................16 


SOUND SOURCE LOCALIZATION................................................................................................... 17 3.1. ASCENDING PATHWAYS OF THE AUDITORY NERVE ..................................................................... 17 

3.1.1.  THE AUDITORY NERVE ..........................................................................................................18 3.1.2.  COCHLEAR NUCLEI ................................................................................................................ 19 THE VENTRAL COCHLEAR NUCLEUS .....................................................................................19 THE DORSAL COCHLEAR NUCLEUS ....................................................................................... 20 3.1.3.  THE SUPERIOR OLIVARY COMPLEX ....................................................................................... 20 3.1.4.  THE LATERAL LEMNISCUS ..................................................................................................... 21 3.1.5.  THE I NFERIOR COLLICULUS ................................................................................................... 22 3.1.6.  THE THALAMUS’ MEDIAL GENICULATE BODY ...................................................................... 23 3.1.7.  THE AUDITORY CORTEX ........................................................................................................23 

3.2. SOUND SOURCE LOCALIZATION ................................................................................................... 24 3.2.1.  THE HORIZONTAL PLAN ........................................................................................................24 3.2.2.  THE VERTICAL PLANE ...........................................................................................................27 3.2.3.  DISTANCE FROM THE SOURCE............................................................................................... 28 

4  THE MACHINE: R EQUIRED BACKGROUND ..................................................................................30 4.1. I NTRODUCTION TO THE MACHINE ................................................................................................ 30 4.2. HEAD-R ELATED TRANSFER FUNCTIONS....................................................................................... 30 4.3. CHANNELS VS OBJECTS................................................................................................................ 31 4.4. CONVOLUTION .............................................................................................................................32 4.5. I NTRODUCTION TO DIGITAL FILTERS............................................................................................ 34 

5  THE MACHINE: IMPLEMENTATION STRATEGY............................................................................ 37 5.1. DESIGN GOALS.............................................................................................................................37 5.2. OVERVIEW OF THE CHANNEL-BASED MODEL .............................................................................. 38 5.3. OVERVIEW OF THE OBJECT-BASED MODEL .................................................................................39 

When SAE Brussels presented this thesis project and requirements to theirstudents, the goal was clearly stated that the research was meant for them to specialize

in a specific field by allowing them to do their research on a topic that was studied in

class, and from there take their knowledge to the next level in order to help them

establish a specialty that would be a helpful skill for initiating a career. This project

was extremely appealing to me, because I saw in it a great opportunity to gain some

level of expertise in a field I had been attracted to: binaural 3D sound.

As I was starting to gather basic information about this field of study, it did not

take me a very long time to realize that the complexity of the matter resided in the fact

that it required at least a certain degree of expertise in several different technical fields,the most important ones namely being digital signal processing, psychoacoustics,

 programming, mathematics and physics. At that time, all I had was my current (basic)expertise of sound engineering but, through the motivation I had, I felt empowered and

determined to stick with binaural 3D sound, enough to patiently gather all of therequired knowledge and eventually tackle that complexity in order to make it my own

field of expertise.

This paper is the result of nine months of dedicated work and is composed of

three main parts that I decided to respectively entitle The Body, The Brain and The

Machine. The majority of the time spent on this paper was dedicated to learning theskills necessary to create the models presented in the last part of the paper, and to

interconnect them. In this context I attended two days of conferences in May 2013about binaural 3D sound for broadcasters at the EBU headquarters in Geneva,

Switzerland. During that time a presentation was given by Poppy Crum (Dolby) aboutthe neuroscience behind spatial binaural sound, which was entitled « Neural

Sensitivities – HRTF Representations In The Auditory Pathway ». From then on, I

grew so intrigued and fascinated about the way humans hear (and more generally, the

way the brain works) that this presentation marked a turning point in my research. In

fact, I decided to include two large sections dedicated to the human hearing physiology

and psychoacoustics (respectively entitled “The Body” and “The Brain”), which were

notably meant to provide my readers with the background knowledge necessary to gain

insights into the binaural implemention models presented further on.

Indeed, the ultimate purpose of the last part of this work (The Machine) is to present

and compare binaural implementation models respectively based on the two current

digital sound technologies, namely the channel-based model and the object-based

model. The purpose is to theoretically assess their qualities and drawbacks to

 binaurally reproduce 3D sound. It is worth mentioning that “The Machine” is the

subject to an AES Convention paper that I shall present on October 17, 2013 in New

York City.

First of all, I would like to thank our team of helpful supervisers as well as ourinspirational teachers (and really the entire SAE Institute Brussels) for giving their

students the wonderful opportunity to let them do their thing. As far as I am concerned,I know that every day spent at this school has contributed to making me a more

inspired individual and in the end a better person. What else could I have asked for? I

would like to give a special thank you to Robin Reumers for the time he invested insupporting the last sections of this work. I would like to acknowledge his deep

understanding of all things audio (and more) as well as his creative technical problem-

solving skills that turned out to be very useful more than once. I would also like to

thank my family, my lover Gracie and my six brothers for the support they have provided me with throughout these several long months that actually passed quicker

than I ever thought. I feel blessed!

The purpose of this section is to give a small introduction to some fundamentals

of sound physics that will turn out to be quite useful in our understanding of the human

hearing physiology. Let us dive in…


What is a sound wave? A sound wave is a vibrational movement of air

molecules around their initial positions. It is important to realize that the propagationof sound waves is very different from the wind phenomenon where molecules are

flowing over large distances!

A sound wave is defined by two main attributes: its frequency and its amplitude

(or intensity). The frequency of a sound wave refers to the number of waves to pass a

specific point in space every second. Frequency, which is commonly specified in Hertz

(Hz) or cycles per second (c/s), holds the subjective correlate of pitch when perceived

 by most of the living organisms. The amplitude of a sound wave refers to the

magnitude of the vibrating movement of an air molecule, and ends up bringing the

subjective correlate of loudness.1  However, other parameters depend directly upon

those two main attributes: a sound wave frequency defines its period (i.e. the length of

time necessary for the wave to execute one full cycle of air molecules pressure –decompression) and its wavelength (the distance in meters for the execution of one

 period), whereas a sound wave’s amplitude dictates the consequent air pressure

variation, air velocity and air displacement. The air pressure relates to the level ofcompression of the air molecules. When speaking of pressure, it is important to clarify

that it is the atmospheric pressure of the free field, which the sound wave is travellingthrough that varies. However, a sound wave travelling in a free field has proportionally

very little impact on the average atmospheric pressure of this free field; indeed, a levelas high as 140dBSPL makes the general atmosperic pressure vary by only about 0,6%.

The air velocity relates to the rate of change of position of the air molecules, whereas

air displacement relates to the distance of displacement of those molecules around their

equilibrium position.Sound waves can be described using other parameters as well, and it is possible

to easily make them relate to each other using simple equations. For example, in the

case of a sinusoid, its peak pressure ( p)  above the mean atmospheric one and its

velocity (v) relate to each other in the following way:

Where z  represents the impedance of the medium.

1 Both the pitch and loundess perceptions will be discussed further on in this work, as such phenomena

 belong to the realm of the brain and not the physical world per se. 

 p =  z.v

The impedance of a medium is an appropriate concept to address at this point

 because it will show to be of great importance in the physiology of hearing, in the form

of « impedance jumps ». The impedance of a medium can be thought of as itsresistance. For example, water holds much higher of an impedance than air, becausethe pressure required in order to produce a sound at a given intensity in water is much

higher than the pressure required to produce a sound with the same intensity in air,which is simply due to the fact that the density of the water molecules is much higher

than the the density of the air molecules. It will thus require proportionally more

energy to give those water molecules some velocity. In the SI system, the impedence

( z ) is measured in (N/m2)/(m/s) or N.sec/m3. In order to gain some understanding of

our water/air example that will matter later on in this work, let us associate it with

some figures… The impedance of air at room temperature (20°C) is about 413 N.s/m3

whereas the impedance of water is of about 1,5x106 N.s/m

3, which means that the

water « resistance » is about 3632 times higer than the air’s. When a sound wave propagates in the air at 20°C and meets a water surface, the change of impedance is

such that only 1/3632 of the incident wave’s intensity is transmitted into the aquatic

middle. The result of this impedance jump is that most of the sound wave is reflected

against the water surface according to the laws of physics. The following formula

allows to calculate the proportion of an incident wave propagating in a middle of

impedance z 1 that will be transmitted into a second middle of impedance z 2.

When pluging our air and water impedance values into the equation, the outputtranslates into a propagation of our incident sound wave into the water middle of only

about 0,11%. Converted into decibels, it equals a 30dB attenuation. In a furthersection, we will address how the middle ear manages to transmit a signal to the brain

despite such attenuation.


A further concept to be introduced is the Fourier transform, which not only will

help us understanding the hearing

 physiology in the scope of the cochlea, but

its implications are so broad that it will be

mentioned in the fifth part of this work

(“The Machine: Implementation

Startegies”), that concerns signal

 processing. Joseph Fourier showed that it

was possible to decompose a complex

signal into a sum of sine waves. That

 process is called the Fourier analysis and itallows to bridge the gap between a signal’s

time and frequency domain. When it is plotted with RMS value on axis y and

frequencies on x (as it is generally the

4 z1 z


( z1 +   z2)2

Figure 1. Decomposition of a square wave ys into a

series of sine waves y1, y2, y3  etc. using Fourier

analysis. Retrieved from :



dle/ on Sept. 18, 2013.

case), it is possible to easily know which frequencies compose a complex signal, as

well as their respective intensity (see Figure 1).

The analysis of an infinite sine wave represented in the time domain will resultin a single line in the frequency domain, indicating that wave’s frequency and

intensity. Very similarily, the analysis of an infinite square wave will show the

fondamental with its odd harmonics. However, this model is obviously purelytheoretical, as no signal is infinite. What happens with finite signals? The indication ofthe different frequencies will broaden and turn into bands, whose breadths are

inversely proportional to the length of the input signal. The longer the signal, the better

resolution we are able to get in the frequency domain.

One may be tempted to ask: « why sine waves? ». For several reasons: sine

waves are quite easy to handle mathematically. They also happen to represent the

oscillation of quite a few physical systems, therefore being very present in natural

 phenomena. But probably the most interesting reason why Fourier analysis shows to be

of great importance in hearing physiology is because our ear actually performs this

 process constantly, although to a limited extend. As mentioned above, this feature will

 be discussed in the cochlea section.The reversed process of analysis (taking many sinusoids and adding them

together to form a complex signal) is called the synthesis, but will not be as useful as

the Fourier analysis in the scope of this work and will therefore not be addressed.


The last concept to be introduced in this section is linearity. Such notion will be

useful to describe and properly understand the different stages of the auditory system.A system is referred to as « linear » when it verifies two properties: superposition and

homogeneity. A system is non-linear when one of those conditions is not fulfilled.Mathematically, those properties are respectively defined as follows.

The superposition property states that for two different input  x and  y, both belonging to the domain of the function f :

Put in plain words, this equation tells us that the result of two or more inputs plugged

in at the same time is the same as the addition of the results from the inputs plugged

into the system separately.

The homogeneity property states that for any input x in the domain of function f  

and for any real number k: 

This equation tells us that if the input is affected a factor k , the output will be affected

the same factor k  as well.

The fact that a system is linear implies one more important property: the

frequencies contained in the output of the system were present in the input signal in thefirst place! Indeed, a linear system does not generate new frequency components.

 f ( x + y) =   f ( x)+   f ( y)


 f (kx) = kf ( x)

The auditory system is thesensory system that allows humans to

 perform the mechanoelectrical

transduction of sound waves into

neural action potentials. This highly

complex system is situated outside (for

the pinna) and inside the temporal

 bone (shown in red in Figure 2).

The human hearing system

comprises three main parts: the outer,

middle and inner ears. Their anatomies

and roles will be investigated in thissection. Nonetheless, it is worth noting

that the great complexity of the physiological side of human hearing

will only allow us to scratch the surface of this most fascinating topic in the scope ofthis work.


The outer ear consists of a partially cartilaginous shape called the pinna that

comprises a resonant cavity called the concha, which forms the entry of the ear canal

(also called the meatus) that leads to the tympanic membrane (also referred to as theeardrum). The outer ear fulfills two main roles: it helps localizing sound sources and

increases the intensity of the incoming sound waves.

The pinna holds a paramount role in this paper because it is one of the main

actors in our ability to localize sounds. Indeed, the pinna’s shape (which is veryindividual and can be quite different from a person to another) allows to spectrally

modify the incoming sound waves in order to give the brain the necessary cues neededto assess the sound sources’ positions on the vertical plane. The second important

aspect of the pinna is to funnel the waves reaching the pinna into the ear canal. This process allows to increase the intensity of the sound waves reaching the eardrum by

about 15 to 20dB in the 2,5kHz range (Wiener and Ross, 1946) in the form ofresonances produced either by the association of the concha and the meatus (2,5kHz

resonance), or the concha alone (5,5kHz resonance).

It is possible to measure the influence of the pinna on the waves coming from a

sound source at a known azimuth, elevation and distance. This information, which is

extremely valuable in the scope of this paper, is called the Head-Related Transfer

Function (HRTF) and will be discussed in further sections.


The middle ear consists of the ossicles (malleus, incus and the stapes, which is

also known as the stirrup) and acts as an intermediate step between the eardrum and thecochlea in the way of an impedance transformer. Indeed, the purpose here is to turn

Figure 2. The temporal bone in represented in red.

Retrieved from :http://commons.wikimedia.org/wiki/File:Temporal_bone.

png in July, 2013.

acoustical energy into mechanical energy. Being attached to the ear drum, the malleus,

which is attached itself quite rigidly to the incus, vibrates at the same rate as the

tympanic membrane and the association of those two bones transmits the force to thestapes (about the size of a grain of rice), which is connected to the cochlea’s oval

window that will be discussed in the next section. Interestingly enough, those three

small bones stop growing very early in a newborn’s life, making them the same size asan adult’s.

As mentioned in the previous paragraph, the role of the middle ear is to

transform the impedance from the large, low-impedance eardrum to the small, high-impedance oval window. Without this middle ear section, the reflections due to the

impedance jump would be so high that only a fraction of the incident wave wouldmanage to enter the oval window, and the subsequent perceived level would be much

lower. The ossicles thus allow to substancially reduce this energy attenuation. At this

 point, it is worth mentioning that the actual functioning of the impedance transforming

 process is quite complex and since a thorough explaination of it would not

substancially help the proper understanding of the following sections, it shall remain

superficially covered. However, it can be noted that this impedance-transforming

 process is supported by two principles. The first one is that since the stapes’ footplatein the oval window is much smaller than the ear drum where the vibrations are coming

Figure 3. Cross section of the temporal bone, revealing the main parts involved in the outer, middle and internal

ears. Retrieved from:http://www.directhearingaids.co.uk/index.php/33/how-hearing-balance-work-together/ in

August, 2013.

from, it is logical to state that the energy is going to concentrate in a smaller area, thus

effectively increasing the pressure at the oval window. The actual increase is calculated

 by the ratio of the two areas. The second principle, though less prominent, is caused bythe lever action of the incus. Being smaller than the malleus, the incus allows to

increase the force and decrease the velocity transmitted to the stapes.

What about the linearity of transmission of the ossicles? Guinan and Peake(1967) found that the stapes movement increased proportionally to the input up to130dBSPL for frequencies below 2kHz and up to about 140 to 150dBSPL for

frequencies above. Those results thus seem to point towards the linearity of

transmission in the ossicles up to those intensities and, although the system of

measurement used in that specific research would have only allowed to see detect 10-

20% of odd harmonics, there is likely to be no significant harmonics or

intermodulation products at lower intensities. However, it is worth mentioning that the

suggested linearity of the middle ear may be affected by static pressures applied to the

ear. Indeed, such pressures would make the joint connecting the malleus and the incus

more rigid and stretch the ligament connecting the stapes to the oval membrane.

Another element also influences the linearity of the middle ear beyond 75dBSPL: themiddle ear muscles.

Two main striated muscles attached to the ossicles act as protections to

damages in the inner ear. The tensor tympani is attached to the malleus (on the

eardrum’s side) whereas the stapedius muscule is attached to the stapes. When sound

 pressure levels of frequencies below 1-2kHz become too important, the inner ear

muscles contract and allow to increase the rigidity of movement of the ossicles.However, their action is quite complex and they have shown to have repercussions in

high frequencies as well. It would therefore be correct to say that humans are equippedwith multiband compressors right in their ears… Wever and Vernon (1955) actually

showed that this muscle contraction reflex allows to keep quite a constant intensity ofstimulus reaching the cochlea for low frequencies beyond the reflex threshold (around

75dBSPL), effectively acting as a multi-band brickwall limiter !

Figure 4. Detail of the middle ear. Retrieved from :http://cueflash.com/decks/PHYSIOLOGY_OF_AUDITION_-_54 in August,


After the middle ear comes the inner ear, which is composed of the cochlea and the

 bony labyrinth containing itself the vestibular system. The vestibular system is

responsible for the sense of spatial orientation and balance. We shall focus on the

cochlea, which is the central piece to our auditory system and by far the most complexone. Its intrinsic role is to convert the physical vibrations received from the action of

the ossicles into electrical information that the brain can recognize as sounds and its

 basic understanding will require some chemical and electrical explainations.


Anatomically the cochlea is a

coiled tube separated

lengthways into three sectionsknown as the scala vestibuli,

the scala media and the scala

tympani. Those three scalaespiral together from the base of

the cochlea (the larger side) tothe apex (the narrower, pointy

side), keeping their proportionsthroughout their turns. The

cochlea’s size is about 1cm inwidth and 5mm in height. The

 proportion of the scala media

 being smaller than the ones of

the outer scalae, the outer

scalae are led to have a

common separation, which is

an osseous surface called the

spiral lamina. This surface is

situated close to the modiolus,

which consists of the spongy

 bone around which the scalae

turn approximately two and ahalf times. The modiolus

contains the spiral ganglionthat shall be mentioned again

later on. The Reissner’s membrane separates the scala vestibuli from the scala mediawhereas the basilar membrane divides the scala media from the scala tympani. The

 basilar membrane notably serves as the surface on which lays the organ of Corti, whichcontains the auditory transducers that are called « hair cells ». The scalae contain fluids

called the perilymph (outer scalae) and endolymph (scala media). The two outer scalae

meet at the apex of the cochlea in an opening called the helicotrema, allowing for the

 perilymph to connect. The scala media is a closed cavity whose endolymph does not

directy interact with the exterior.

Figure 5. Cross section of the cochlea, providing a good view of thethree scalae as well a detailed view of the contents of the scala media.

Retrieved from: see image.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

When the stapes’ vibrations are transmitted to the oval window, it produces a

displacement of the fluids within the scala vestibuli, which is transmitted to the scala

tympanic through the helicotrema. This phenomenon allows the basilar membrane to

 be displaced in a wave-like movement, along with the organ of Corti that is attached toit in the scala media, effectively allowing for the hair cells to be stimulated and totransmit electrical impulses onto the brain.  THE ORGAN OF CORTI 

Those organ of Corti’s hair cells amount to about 15,000 in each of a human’s ear and

can be declined in two kinds: the inner hair cells (IHC), on one row situated on the

modiolus side of the cochlea (i.e. toward the inside) and the outer hair cells (OHC),

ranging from three to five rows (increasing toward the apex). The hair cells are found

inside the reticular lamina. From each hair cell sticks out the stereocilia2, which is the

 part of the hair cell that acts as the initial sensory transducers. Stereocilia is made outof long filaments whose stiffness allows them to stand on the lamina and act as levers

in response to mechanical deflections. The longer of the OHC’s stereophilia are

embedded in the undersurface of a gelatinous body called the tectorial membrane.

Being attached on one side only (toward the modiolus) above the organ of Corti and

the basilar membrane, the tectorial membrane allows to create a deflecting movement

the hair cells according to movements of the basilar membrane. On inner hair cells, the

stereocilia is composed of three to five nearly straight rows, while on outer hair cells it

is composed of three to five V-shaped rows.


When sound waves reach the eardrum, its vibrations are transmitted to the ovalwindow through the ossicles and the stapes. When vibrating, the membrane of the oval

window initiates a wave of movement of cochlear fluids, transmitting the fluids to theround window. This phenomenon causes the cochlear partition (i.e. the basilar

membrane and the organ of Corti) to move according to this transmitted wave’s position and patterns, effectively revealing the frequency content of the stimulus to the

 brain once the hair cells are stimulated and the information is sent over to the

ascending pathway.

G. von Békésy was the one that pioneered the cochlear research and a lot of the

current knowledge on this matter is owed to his studies and experiments described in

Békésy, 1960. He analysed the movements of the cochlear partition on human

cadavers, was able to plot the travelling wave-patterns and drew conclusions from

them. As can be seen on the scheme, the amplitude of movement of the cochlear

 partition is contained within an amplitude envelope, never exceding it. Some of

Békésy’s important findings could be summerized as follows:

(1)  As we have seen, vibrations from the stapes at any frequency allow for a

specific travelling wave to be initiated within the cochlear fluids. The travelling

wave’s pattern and its peak location in the cochlear duct depend on the

frequency of the stimuli brought by the stapes.


 It is important to mention that the specialized literature speaks of the stereocilia both as stereocilia andhair cell. We can comprehend the intended meaning according to the context, either evoking the whole

cell (stereocilia plus the part contained in the reticular lamina), or only the actual stereocilia.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 15/48

 peak regions of the travelling waves of frequencies of 10Khz and higher. As a side

note, it is interesting to note that this feature seems to vanish after death.

Let us go back to this question of physical limitation of movement amplitude ofthe cochlea. In other words, why is it that, physically, the cochlear partition is not able

to be stimulated in a linear way? It is due to the common action of two rather simple

 phenomena that can incidentally be repetitively found in Nature: the stiffness and masslimitations. In order to explain them, let us refer back to the cochlear amplitudeenvelope described by Békésy. The first phenomenon, referred to as the stiffness

limitation, explains the reason why the cochlear partition is not able to be fully

stimulated from the base to this resonance point situated closer on the apex side. The

cochlear partition was actually shown to be relatively rigid near the base, gradually

 becoming more compliant as it progresses to the apex. This stiffness is the main factor

 preventing the cochlear partition to move freely according to the way it is stimulated.

The second phenomenon, called the mass limitation, gives the reason why the cochlear

 partition’s amplitude potential rapidly decreases from that resonance point on to the

apex: although the cochlear partition is now more compliant than it was at the base, its

larger mass and inertia limit its amplitude of movement.



The two outer scalae, the scala vestibuli and the scala tympani, contain

 perilymph whereas the scala media contains endolymph. Chemically, those two

extracellular fluids are quite different from each other. Let us give them a bit of a

closer look.

Contained in the outer scalae, the perilymph is very similar to most otherextracellular fluids because of being mainly composed of cation sodium (Na+) and, to a

much lesser extend, cation potassium (K +). Its electric potential is positive, and wasreported by Johnstone and Sellick (1972) to be of +7mV in the scala tympani and of

+5mV in the scala vestibuli, i.e. close to ground potential.The endolymph is contained in the scala media. Oppositely to the perilymph, its

chemical composition is mainly K + and, to a lesser extends, Na+. Endolymph is a very

unique kind of extracellular fluid. For two reasons: (1) given its general composition,

endolymph is very comparable to intracellular fluids and (2) its very high positive

 potential — referred to as the endocochlear potential and varying from +100mV to

+83mV in a declining fashion from the base to the apex — has not been found in any

other extracellular fluid. Its chemical and electrical uniqueness therefore point at a very

specific role played within the cochlea. Indeed, according to the investigations that

have been undertaken, the endolymph’s mentioned characteristics were found to play

important roles in mechanotransduction as well as mechanical amplification of the

travelling waves propagating in the cochlea. 

Once the haircells’ deflections have produced electrochemical impulses

travelling through the auditory nerve fibres in the spiral ganglion, those impulses are

now sent through several different parallel pathways that shall be introduced in this

section. Such operating way allows the brain to simulateously extract multiple features

of the stimulus that will show to be of great importance in order for it to create a

representation of the so-called « auditory object ». For example, sound localization onthe horizontal plane relies mainly on interaural time and level differences (respectively

ITD and ILD), while sound localization on the vertical plane notably requires the

complex analysis of stimuli’s spectra. However, the proper analysis of spectral

information prevents from a reliable analysis of time information, thus showing the

need for parallel pathways of stimuli analysis.

This section is meant to give an idea of those different pathways used by the

 brain to interpret the electrochemical stimuli received by the haircells, as well as

 presenting the different important brain areas where the information is treated,

mentioning their cells compositions and their purposes. It is well worth mentioning that

is it difficult to explain each section’s functions individually because the auditory

nervous system is organized hierachically. Indeed, the information analysed in thelower stages of the process is sent over to higher stages that basically analyse the dataand only send over to the next stages the relevant information at the time in order for

the auditory cortex to eventually represent this auditory object we hear. The« resolution of representation » of this auditory object thus increase as the different

information analysed in the lower stages is put together and made sense of in thehigher stages.

For reference, a very simplified plan of the different stages of the ascending pathways would go as follows:

(1)  After haircells deflections in the left cochlea the electrochemical impulses are

transmitted to the left cochlear nerve (auditory nerve) situated in the modiolus.

(2)  The output fibres of the cochlear nerve branch. One end enters the left ventral

cochlear nucleus (VCN), while the second end enters the left dorsal cochlear

nucleus (DCN).

(3)  The outputs of the left VCN enter both the left and right superior olivary

complexes (SOC). This fiber pattern is referred to as the trapezoid body;  the

left DCN outputs directly enter the right lateral lemniscus nucleus (LLN).

(4)  The outputs of the left and right SOC respectively enter the left and right LLN.

(5)  From that point on, the left and right parts of the brain do not any longer

connect contraleterally (with the other side of the brain). The left LLN connects

with the left inferior colliculus (IC). 


The left IC connects to the left medial genuticulate body (MGB) in thethalamus.

(7)  The left MGB connects with the auditory cortex. Both parts are able to

communicate back and forth.


Figure 6. Representation of the ascending pathways of the central auditory nervous system.

Retrieved from: http://origin-ars.els-cdn.com/content/image/1-s2.0-S1527336908001347-gr3.jpg

in September 2013

The auditory nerve is situated in the modiolus, on the inner side of the cochlea.

Its afferent fibres are situated at the base of the hair cells, transporting the electricalimpulses from the cells to the auditory nerves then onto the brainstem. The efferent

fibres are placed around the same places, and they allow the brainstem to influence the

cochlea. Both the efferent and afferent fibres lead to the spiral ganglion in themodiolus.

Inner hair cells and outer hair cells are innerved completely differently. Shortly,

we can say that there are two types of afferent fibres: Type I (also called radial fibres,

comprising 90 to 95% of them) and Type II (also referred to as outer spiral fibres,

comprising the remainder of the afferent fibres). Each inner hair cell receives about 20

to 30 Type I fibres (according to Liberman et al., 1990) whereas each outer hair cell

receives about six Type II fibres. Every Type I fibre is connected to only one hair cell,

 but Type II fibres branch and end up innervating about ten outer hair cells. However,

outer hair cells are not only connected to those few Type II fibres (compared to the

innervation of inner hair cells and Type I fibers), but they are linked to other synapses

coming from different afferent fibres as well.


The cochlear nerve sends information to two entities:  THE VENTRAL COCHLEAR NUCLEUS 

Because of its specialization of analysis of time and intensity information, the

ventral cochlear nucleus contributes mainly to the pathway of binaural localization (on

the horizontal plane). Other contributions are given to the pathway of sound

identification. The ventral cochlear nucleus is itself composed of two areas:

•  The Anteroventral Cochlear Nucleus (AVCN): 

The AVCN contains a type of cells called « bushy cells » (named for the bushy

 patterns of their dendrites), known for their effectiveness to rapidly and reliability

transmit the impulses they receive to the next stage. There are spherical and globular

 bushy cells.

Spherical bushy cells transmit to the superior olivary complex the information

of the stimulus’ time of arrival. There, this time information will be compared to the

information of time of arrival coming from the other ear. On the other hand, the

globular bushy cells handle intensity information. Just like the spherical bushy cells,globular bushy cells send this information to the superior olivary complex, where the

intensity information from both ears is to be compared.The AVCN is thus responsible for sending information to the higher stages that

will turn out to be very useful in the scope of binaural sound localization in thehorizontal plane.

•  The Posteroventral Cochlear Nucleus (PVCN): 

The PVCN’s structure is slightly more complex than the AVCN’s in the sensethat it comprises four types of cells: globular bushy cells, octopus cells and two types

of stellate cells (T-stellate and D-stellate in the 95% - 5% proportions).

Octopus cells are useful for two main reasons. Firstly, they have a pattern of

response called the « onset response » because of their ability to fire very strongly at

the onset of a new stimulus. Secondly, they have an extremely high resolution ofresponse for transcients in ongoing stimuli (they can detect more than 500 transcients

 per second!). Moreover, their spectral range of action is very wide. Therefore, it is

thought that octopus cells are specialized in the extraction of temporal fluctuations incomplex broadband stimuli such as the human voice.

T-stellate cells fire repetitively when they receive stimuli corresponding to a

sustained tone burst. However, their firing rate is not related to the frequency of the

tone. They send this information to several different areas that are part of this

ascending pathway. D-stellate cells shall not be described here.

Summerizing, we can say that the PVCN therefore gives contributions to two

 pathways: binaural sound localization (on the horizontal plane specifically) as well as

sound identification.  THE DORSAL COCHLEAR NUCLEUS 

The dorsal cochlear nucleus gives great contributions to the pathways of sound

identification as well as of binaural sound localization (but on the vertical plane this

time). It is composed of three layers, but we shall only focus on the second, most

important, pyramidal cell layer.

The pyramidal cells (also called fusiform) project primarily to the contralateral

inferior colliculus (i.e. on the other side of the brain) through the lateral lemniscus

nucleus. Unfortunately, studying them is a complex endeavour because of their strong

vulnerability to anaesthesia. However, we do know that their response patterns givecontributions to the sound identification pathway. Since this work tends to focus on the

localization of sound, those responses will not be presented and we shall focus a bitmore on the pyramidal cells’ contributions to the binaural localization pathway.

It is known that notches in the spectral content of the stimuli can strongly drivethe pyramidal cells, if the frequency of this notch is close to the frequency at which the

cell is tuned. Those notches are produced by the pinna and their frequencies are

strongly influenced by the elevation angle of the sound source. Evidence for this

explaination were found for cats that had lesions in those parts of the brainstem, when

we realized they could not any longer make reflex orientations of their heads upwards

towards the position of the sound source (Sutherland et al. 1998a, b and 2000). Indeed,

it is thought that pyramidal cells play an important role in the unlearned  action, as it

was still possible for cats to learn to discriminate between sound sources at different

elevations using a behavioural conditioning task. Therefore, the dorsal cochlearnucleus certainly plays a role in the binaural localization of sound sources on the

vertical plane, but it must not be the only one.


Two trends can be discriminated from the output streams coming out of the

cochlear nuclei: the binaural localization pathway is served by the ventral stream,which is itself divided into one section relaying intensity information as well as a

second one relaying time information, and the identification pathway is served by thedorsal stream.

The dorsal stream is directly sent to the inferior colliculus (through the laterallemniscus), while the ventral streams enter the superior olivary complexes on both

sides of the brain. The first one, conveying intensities information, enters the lateral

superior olive (LSO) along with the same stream coming from the other ear, where the

intensities information conveyed in the streams of both ears will be compared. Much inthe same way, the time information from both ears will reach both medial superior

olives (MSO), one on each side of the brain, where timing information will be

compared. The seminal Jeffress model (Jeffress, 1948) suggests an explaination for thiscorrelation process and will be presented in a further section.

The LSO contains cells of the « IE » type. In this type of terminology, the first

letter represents the response of the contralateral ear (I = Inhibitory4) and the second

letter represents the response of the ipsilateral one (E = Excitatory5). In order to

examplify this concept, rough trends can be given as follows. Firstly, the ipsilateral,

excitatory ear alone is presented a tone. The IE cells’ firing rate is maximal. Then a

tone is introduced to the contralateral, inhibitory ear. As the intensity of that second

tone is raised, the firing rate decreases until it reaches a value close to zero, when the

contralateral tone intensity equals the intensity of the ipsilateral one. As will be

explained later on in this work, ILD are mostly relevant for high frequencies. That is

the reason why the LSO is mostly reactive to high frequencies.On the other hand (as previously mentioned) the MSO receives streams coming

from the bushy cells of the AVCN on both sides, conveying timing information.

Thanks to the spherical bushy cells’ ability to fire almost instantly, the nucleus is able

to very reliably compare both ear’s times of arrivals and thus retrieve valuable

information for binaural localization on the horizontal plane. How does it work? The

MSO has a very thin, sheet-like structure and is composed of a single layer of fusiformcells, most sensitive to low frequencies. Although we will not develop too much on

this topic, we can simplify and say that the timing information is compared thanks tothe fact that each fusiform cell is tuned to fire maximally at a given, characteristic

delay between both times of arrival. The treatment of the information of localization onthe horizontal plane in the higher stages of the afferent pathways is thus dependent

upon the quantity of electricity fired by each fusiform cell.

It should also be mentioned that a minority of EE cells is contained in the LSO,

allowing it to not only analyze intensities information (its specialty), but timing

information as well. Similarly, a minority of IE cells is contained in the MSO, allowing

it to not only process timing information but also intensities information.


The lateral lemniscus is a tract through which run the ascending pathways from

the superior olivary complex to the inferior colliculus. Two major nuclei are containedwithin the tract, known as the ventral and dorsal nuclei of the lateral lemniscus

(respectively VNLL and DNLL). Although a majority of fibers are connected to one of

the nuclei, some of them simply run through the tracts, entering the inferior colliculus


The VNLL is part of the monaural sound identification stream6, receiving its

inputs from axons of the contralateral ventral cochlear nucleus as well as other nuclei

that were not mentioned in this work for the sake of simplicity. Since it does not

4 Definition of inhibitory: « slow down or prevent (a process, reaction, or function) or reduce the activity

of (an enzyme or other agent). »5

 Definition of excitatory: « characterized by, causing, or constituting excitation. »6 The stream that deals with the identification of sounds retrieved by a single ear, as opposed to binaural

information that would have been previously interpreted in the superior olivary complex.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

receive inputs from the MSO nor the LSO, it does not appear to be playing any role in

the binaural localization pathway. It projects ipsilaterally to the inferior colliculus in a

complex pattern. Although experts are still unsure about the actual role of the VNLL,Langner (2005) actually speculated that it could potentially be able to extract harmonic

relations between stimuli.

The DNLL is part of the binaural sound localization pathway, receiving outputsfrom the ipsilateral MSO, LSOs from both sides and contralateral cochlear nucleus. Itsrole in mainly inhibitory, allowing to eventually enhance the lateralization of sound

sources that was previously created in the lower stages of the ascending pathway. It is

interesting to note that, due to the lasting effect of its inhibitory projections (that will

not be explained here) its role also allows to enhance laterization by suppressing

echoes in echoic environments (Pecka et al., 2007).


The inferior colliculus could be seen as the most important « data center » from

the lower parts of the brainstem. Here, the vast majority of information previously

treated from the different pathways (mainly sound localization, sound identification,

and their respective « sub-pathways ») is connected and the image of the auditory

object that will eventually be perceived in the auditory cortex is starting to strongly

refine. Of course, this paramount stage of synthesis of basic elements entails a whole

new level of complexity. Situated close to the superior colliculus (which is itself theimportant integrative reflex center of the visual nervous system), the inferior colliculus

is composed of three divisions: the central nucleus, the external nucleus and the dorsalcortex. The central nucleus (ICC) is innerved mainly by fibres running through the

lateral lemniscus, while the external nucleus and dorsal cortex receive fibres that do

not run through the tract. Those two are in charge of treating information surroundingthe auditory system, only indirectly bringing an improvement to the eventual auditory

object. Instead, the « extra lemniscus » pathway (as it is referred to) also comprises

multisensory stimuli.

The ICC receives information from all four sources of binaural localization:

LSO (center of analysis of intensities differences), MSO (center of analysis of timing

differences), the DNLL (responding to both cues) and the DCN (dorsal cochlear

nucleus – retrieving information needed for the proper localization of sounds on the

vertical plane).

The ICC is tonotopically organized in laminae (thin layers of organic tissues).

Said differently, all of the fibres carrying different information related to a common

characteristic frequency will meet on the same layer of the ICC. Studies carried in therecent years have been able to suggest some of the interactions of the four sources of

 binaural localization in the IC. Indeed, Loftus et al. (2004) showed that the low-

frequency laminae (where ITDs dominate) receive inputs from the ipsilateral MSO

(processing of ITDs) but also, interestingly, receive inputs from the ipsilateral LSO(processing of ILDs). On the other hand, the high frequency laminae were shown to

receive inputs mainly from the DCN (which, as a reminder, process high frequencynotches used in the localization on the vertical plane) and, of course, the LSO.

It is worth mentioning that, thanks to recent anatomical evidence researchershave grown to believe in the existence of further maps of information processing (apart

from the spectral one). Indeed, the laminae are two-dimensional and the spectral

organization covers only one axis. Some have suggested that this second dimension ofthe laminae would be home to a map of periodicity detection, but we have yet to prove

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

that claim. Another option (that shall be investigated further on) points at a topological

map dealing with phase correlation of stimuli in the process of creating the perceived

auditory space.The external nucleus receives inputs from the contralateral cochlear nucleus

(including its DCN), the ICC, the auditory cortex (on a descending pathway), as well

as somatosensory7

 input from the dorsal columns and the trigeminal nuclei. The dorsalcortex receives information from the contralateral inferior colliculus as well asdescending inputs from the auditory cortex. Although we do not know for sure the

roles of those two nuclei, many have suggested that the nature of the input received in

the external nucleus seems to point at an auditory and somatosensory integrative area

allowing to launch the required reflexes triggered by certain sounds. This feature of the

auditory system is part of the so-called “diffuse” or “extra-lemniscal” system that was

 previously mentioned.


The medial geniculate body is the last auditory relay before the stimuli enter theauditory cortex, and, within the scope of descending pathways, acts as an intermediary

 between this auditory cortex and the rest of the subcortical nuclei. Moreover, those

ascending and descending connections point at a grouping of the medial geniculate

 body and auditory cortex as a functional unit.

Divided in three different units, the medial geniculate body only has one

section that seems to be involved in the lemniscal auditory pathway: the ventral

section. We will not be focusing too much on the other two, less specific areas of the

medial geniculate body. The ventral section mainly collects information from the ICC(just previously seen). Similarily to the ICC, the ventral section of the medial

geniculate body is tonotopically organized in a laminar structure, and it was suggestedthat a further functional organization was underlying the specific range of frequencies

(the functional groups were termed as “slabs”). The purpose of this ventral section issaid to further sharpen frequency resolution.

The other two sections of the medial geniculate body are the medial and dorsal

divisions. Part of the extra-lemniscal pathway receiving visual as well as

somatosensory information, it is worth mentioning that their responses can change as a

result of learning.


The auditory cortex is the functional unit where all of the previously gathered

information will be assembled in order to form an auditory object in the listener’s

mind. The very large complexity of this unit barely allowing us to scratch its surface in

the scope of this work, the main information here will be covered less precisely and

more abstractly.

The auditory cortex consists of a core unit, surrounded by a belt and a para-belt.

The core unit, mainly receiving inputs from the specific, lemniscal system, is itself

composed of three main sections: the primary receiving area (AI), a secundary area(AII) and further “association” areas. Although the information integration processes in

the auditory cortex are the same for every human, the actual neural responses will be


 Definition from Oxford American Dictionaries: relating to or denoting a sensation (such as pressure, pain, or warmth) that can occur anywhere in the body, in contrast to one localized at a sense organ (such

as sight, balance, or taste).

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

notably a function of the listener’s genetics and previous exposure to this stimulus.

Moreover, more activity is detected for stimuli of current significance to the listener in

his environment. Once the information is analyzed by the core unit, it continues on tothe belt and para-belt for further examination.

If we summerize the stimulus’ course along the brainstem (from the auditory

nerve to the actual representation of the auditory object) in terms of “what” (soundidentification) and “where” (sound localization) pathways, we notice that at first thosetwo streams separate right before entering the cochlear nuchlei (for better analysis of

the involved cues), then progressively reunite in the lemniscal tract up to AI included.

There, as mentioned, the association of spatial and identity specific information allows

to drive the neural activity in a unique way, effectively representing the auditory object

to the listener. Indeed, different objects are represented through different (though

overlapping) patterns of neural activity in the auditory cortex. We hear! Coincidences

of previously experienced patterns of neural activity facilitate the integration of the

known stimuli. Again, the “where” and “what” streams segragate into discrete

 pathways: on one hand, both the identity and localization of the sound are transmitted

to a further dorsal pathway in the brain (enabling to prepare for a potential consequentmotor response) forming a “where” or “do” stream, while a “what” stream continues

on a ventral pathway on to several different parts of the brain.


As we discussed in the precedent section, the process used by the brain to

create a sense of auditory space relies on several different features according to thetype of signal presented to the ears. Indeed, on one hand localization on the horizontal

 plane relies on ITDs and ILDs, effectively using the superior olivary complex’s ability

to make sense of the interaural correlation. On the other hand, localization on thevertical plane as well as judgement of distance to sound source mainly rely on the

analysis of the spectral content of those sound sources. As a reminder, this information

is analyzed in the dorsal cochlear nucleus. Therefore, it would be correct to summerize

a little and write that localization on the horizontal plane mainly uses signals in the

time domain, while the vertical plane as well as the judgement of distance to the sound

source use frequency domain signals. This section aims at briefly presenting those



Out of the three « dimensions » (horizontal, vertical and distance) localizationon the horizontal plane is the best understood. The early findings of Lord Rayleigh in

his Duplex Theory (1907) arguably form the core tenet of knowledge in binaural

hearing. He is the first one to have given an explaination for the ILD and ITD

 phenomena, which contribute to intracranial images assimilated to « lateralization » of

the sound source, i.e. movement of the sound source to the left or right of the listener.

ILDs arise because of the physical dimensions of an incoming sound: very simply,

high-frequency contents of sounds coming from the contralateral side of the ear are

reflected by the head, creating an acoustic shadow. This reflection of high frequencies

has the effect of diminishing the energy-content of the sound reaching the contralateral

ear, thus creating a difference of level with the signal reaching the ear on the same side

as the sound source. As we now know, those ILDs are correlated in the lateral superiorolives.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

On the other hand, ITDs processing relies on a model that has been held as the

reference for over 60 years: the Jeffress model, first introduced in 1948. It aims to

explain the manner time information in binaural signals are correlated in the bilateralmedial superior olives. It consists of an array of

neurons serving as coincidence detectors, firing

maximally when reached by stimuli from both ears.Those detectors are innervated by axons of variablelength that effectively create a system of delay lines,

allowing for one stimulus to reach all of the detectors

 but at different times (as can be seen in Figure 7).

This arrangement allows to create a topographic map

of ITDs since, when simultaneously reached by the

stimuli from both ears, the firing of the given detector

corresponds to a given spatial position of the sound

source. Interestingly, Stevens & Newman (1936) reported that human subjects showed

fewest azimuth sound source localization errors for frequencies below 1.5kHz and

above 5kHz, not only indicating that the brain must therefore use two localization

machanisms (respectively ITDs under 1.5kHz and ILDs above 5Khz, thus backing up

Rayleigh’s work in the process), but also that the confusion reported between 1.5kHz

and 5kHz must indicate that those mechanisms must act simultaneously within that

 band. Those results show to be quite consistent with physiological reports made later

on: the phase-locking of the stimuli in the auditory nerve declines for frequencies

above 3kHz and are reduced to practically nothing around 4 or 5kHz, and the medial

superior olive that was discussed earlier on within the scope of the ITD ascending

 pathway contains more low-best frequency neurones, while the lateral superior olive(ILD pathway) contains more high-best frequency neurones. Phase-locking?

Interestingly, neurones in the MSO are actually not sensitive to time differences  per se  but rather they rely on interaural phase differences between the two ears’ inputs

(McAlpine, 2005).As useful and intuitive as it is, the Jeffress model seems to be nothing but a

model. Indeed, the researchers that applied themselves to find anatomical evidence of

the delay lines concept presented by Jeffress (Smith et al. 1993; Beckius et al., 1999)

never found anything convincing enough to validate the model as factual. However,

ethological evidence (in the barn owl, whose hearing is incredibly developped and

subject to extensive research) have encouraged many to believe that those interaural

time correlations were actually the results of topological maps in which specialized

neurones are tuned to fire maximally at a give phase (without the delay lines, that is),

effectively giving the auditory object its azimuth. It is suggested that such map would be placed orthogonal or parallel to the tonotopic map that was previously discussed in

the section dealing with the central nucleus of the inferior colliculus. However, some

observations seem to deny this claim and it is an ongoing discussion among experts.

As a side note, it has been stated that the brain relies on ILDs and ITDs

correlation to assess the lateralization of a sound source. But what is meant by that?

Interaural correlation actually refers to how similar or dissimilar the signals of a given

sound source reaching the left and right ears are. Two equations are often encounteredin specialized literature, both yielding what is referred to as an index of correlation.

They are formally known as the normalized covariance and the normalized correlation.

Basically, when two signals whose analyzed features are perfectly similar they hold a

correlation index of 1.0. However, it is not my intention to burden this paper withmathematically complex computational models of interraural correlation so I shall not

Figure 7. Representation of the model

presented by Jeffress (1948). From

McAlpine (2005).

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

dig any deeper. Other kinds of computational models for the binaural processing of 3D

audio contents will follow soon enough in the Machine chapters.

Let us discuss the Lord Rayleigh’s findings again for a bit. Apart from the idea ofacoustical shadow formed by the head on the contralateral side of the sound source

 position, which is relevant in high frequencies, Rayleigh also introduced the concept of

« cone of confusion ». The prestigious Oxford Reference website defines this cone ofconfusion as follows:

« A cone-shaped set of points, radiating outwards from a location

midway between an organism's ears, from which a sound source

 produces identical phase delays and transient disparities, making the

use of such binaural cues useless for sound localization. Any cross-

section of the cone represents a set of points that are equidistant from

the left ear and equidistant from the right ear. »8 

The cone of confusion is thus this cone

that could be drawn around a listener’s ear

that contains points whose ILDs and ITDs

values are identical, such that the listenercould get confused as far as the actual position

of the sound source.

A related psychoacoustical

 phenomenon that has left wandering a number

of binaural simulation experts is the front-

 back confusion.9  What does it consist of ?

Quite simply, front-back confusions consist in

the listener’s inability to decide is the sound

source emanates from up front or behind

him/her, or more so to localize a sound upfront when emanating from behind and vice

versa. They are thought to be mostly produced by confusing ITDs values for sound

sources belonging to this cone of confusion we just discussed. Indeed, for any azimuth

up front, the same ITD value exists for a sound source placed at the back. However, the

occurrence of such confusions can be greatly diminished when the listener is able to

rotate his/her head. Indeed, this new dynamic cue is needed to modify the perceived

ITD and ILD values, allowing to help the listener’s brain in making a more informeddecision as to where in space it should place the incoming stimulus. For example, if a

sound source is presented to an azimuth of +20° on the center-right of a listener andthat a front-back confusion occurs when the listener localizes the sound source at

+160° then slightly turning his/her head towards the right will permit to decrease the perceived ITD and ILD values, effectively allowing the listener to localize the sound

source at its actual position. However, as suggested in Wallach (1940) and shown inWightman and Kistler (1999) the actual movement of the head is not necessary for

diminishing front-back confusion. For example, if the listener is placed on a rotating

 platform while receiving stimuli from a static sound source, the listener does not need

8 Definition retrieved from:

http://www.oxfordreference.com/view/10.1093/oi/authority.20110810104643902 on Sept. 17, 13.9

 It is worth mentioning that such confusions mainly happen in experimental conditions and rarely under« normal » conditions of the everyday life. However, it is important to make mention of it as it will show

to be of great importance in binaural virtualization of content discussed in the Machine section.

Figure 8. Representation of Rayleigh's cone of

confusion. © 2007 howstuffwork.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

to rotate his/her head in order to be able to extract the relevant information provided by

the dynamic cue as long as he/she is aware of the direction of his/her relative

movement .The correct positioning of sound sources situated at the rear of the listener’s

head also depend on another stationary factor that is only enhanced by this dynamic

one we just discussed. I am speaking of the phenomenon happening when a soundsource is situated behind the listener and the high-frequency content is not able todiffract around the pinna, resulting in a form of low-pass filtering. As suggested by

Wightman and Kisteler (1997a) front-back differences are mostly indicated by level

differences in the 4-6kHz region.


A good starting point in the discussion of localization

on the vertical plane would be to look at the results obtained

in Butler and Humanski (1991). Listeners were sat by a

vertical arch of seven loudspeakers, fixed to the beam from0° to 90° and positioned by increments of 15°. The testing

was organized under six different conditions: in Conditions 1

and 2 the listeners were presented respectively with 3kHz

low-pass then high-pass noise bursts originating in the LVP

(lateral vertical plane), and they were able to localize sounds

 binaurally, i.e. using both their ears. In Conditions 3 and 4,

the same noises were presented binaurally but this time

originating from the MVP (median vertical plane).Conditions 5 and 6 were similar to Conditions 1 and 2, only

the listeners’ localization abilities were tested monaurally,i.e. using one ear only10. The researchers found that in

Condition 1 (when listeners were presented the low-passnoise in the LVP) they were very capable of localizing the

sound sources. This result was expectable given the previous

discussion we had: the listeners relied on the availability of

 binaural information. However, the listeners performed

 poorly at assessing the sound sources’ elevation in the MVP

(Condition 3) with the same low pass noise. Indeed, no cue

was available to relate to that elevation, as pinnae’s filtering

abilities that could have provided the necessary information

only appear at higher stimulus frequencies (Searle et al.,1975).

On the other hand, when the listeners were presented

the highpass noise bursts (in Conditions 2 and 4) they

 performed (substancially) better, especially in the MVP.

Therefore, it seems clear that localization on the vertical plane depends mostly on the

 pinna’s ability of distorting the stimuli’s high-frequency content in peaks and notches

(mostly between 4kHz and 16KHz — notably see Blauert, 1969) according to thesound sources’ elevation. This apply mostly in localization in the MVP, which is, as

10  Naturally, monaural testing allows to isolate the cue related to high-frequency contents from the ILDs

and ITDs, which are binaural cues.

Figure 9. Apparent elevation of thesound sources plotted against their

actual elevations. From Butler and

Humanski (1992).

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

we shall see, a great area of potential improvement in binaural reproduction of 3D



The ability of a listener in estimating distance from a sound source (withoutvisual capture) depends on his/her ability to (mostly unconsciously) determine the waythe original signal has changed through the propagation process, according to three

main factors: the relative intensity of a sound, the damping effect (i.e. the relativeintensity of high-frequency content) and the direct-to-reverberant energy ratio. This

section holds the purpose of briefly presenting those three important parameters. We

will then discuss the influence of visual capture of the sound source over the estimation

of its distance to the subject, as well as how the accuracy of this estimate directly

relates to the subject’s previous exposure to the perceived auditory object and room


The first factor, which is the relative intensity of a sound, does not quite come

as a surprise as it is well-known that sound waves propagating in free space loose 6dBin sound pressure every time they double their distance with their source. Therefore,

 judgments of distance increase systematically when the relative sound pressure

reaching the eardrums is decreased. This feature points toward a system of internal

reference of the expectation of intensity of a given auditory object compared to the

actual occurence. The comparison of this occurrence on our internal scale allows us to

estimate the distance to the sound source of the given auditory object.

The second factor is very related to the first one, but is to be considered as

distinct nonetheless. It deals with the damping effect, i.e. the amount of high-frequencyenergy that diminishes as a function of distance due to atmospheric absorption.

Coleman (1968) supported this notion by showing that a low-pass-filtered signal (witha gentle slope) was consistently localized further away from the subject than the same

signal unaltered.Finally, the third main cue used used by listeners to assess the distance from a

sound source is the ratio of energies along the direct (i.e. direct field) and indirect (i.e.

diffused field) paths to the receiver. This ratio can be called the « direct-to-

reverberant » energy ratio. The higher the ratio, the closer the estimated sound source

 position and vice-versa.

Researches have shown that the estimation of distance to sound sources using a

single modality (either auditory or visual) greatly vary when compared. For example,

listeners tend largely underestimate those distances when the actual position of the

sound source is more than about a meter away. When estimating the distance of asound source, one would expect that combining vision and hearing would always 

improve the localization. Not so much. Indeed, Gardner (1968) found an effect (that he

termed the  proximity-image effect ) that selects the closest rational visible location as

the apparent sound source position, even though it might be meters further. It is worth

mentioning that in Gardner’s study this effect was reported under anechoic chamber

settings, thus preventing from reverberation to bring further information to the listener,

 but a subsequent research (Mershon et al., 1980) actually found that this proximity-image effect works almost as efficiently in reverberant environments as in anechoic

ones, whereas another (Zahorik, 2001) concluded that this effect is not as definite as previously thought and methodology difference did not allow to draw scientific

conclusions out of the comparison between the studies. This last study also notifiedthat throughout the experiment, listeners seemed to have improved their localization

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

skills within the given environment, suggesting quite clearly that, as obviously

expected, it is possible for the brain to learn from experiences in order to perform its

tasks more accurately.

The ultimate purpose of the two “Machine” sections that are about to unfold are

to be considered an attempt to outline two different ways to apprehend the binaural

offline conversion of three-dimensional audio contents. The first section serves as an

introduction of contents and holds the purpose of introducing several concepts that will

show to be relevant in this technical endeavour, while the second section will actually

 present and compare both suggested models. The first model will rely on channels

while the second one will rely on objects.

Because the research was based on the assumption that both models were to

reach the same design goals specified in section 5.1, the purpose of the present section

is to explore means to reach the said design goals. However, the actual testing of the presented models is not encompassed in the range of this study and will potentiallyform the object of further work.

Sound recording's history has taught us that ever since the appearance of thefirst sound-related technologies in the 19th  century, the driving force behind this

industry's evolution through the decades is marked by the effort to improve the senseof immersion brought to the listener. In that perspective, a highly non exhaustive list of

seminal technical improvements would include the following technologies: thestereophonic sound, first patented by A.D. Blumlein in 1931 (see reference), the

general improvement of analog circuit's linearity throughout the 20th  century

(transducers included), Disney's Fantasia technology "Fantasound" in the 40's, then the

quadraphonic sound in the early 70's, that ushered the way to Dolby® and DTS® 5.1surround sound systems, largely permitted by the digital technology revolution started

in the early 70's, which eventually matured more profoundly in the 2000's. The 2000's

have seen the democratization of higher sampling rates as well as a higher bit-depth

resolution, permitting a more realistic experience of sound. The 2010's are now

witnessing the "3D momentum" (with products such as Barco®'s Auro 3D and

Dolby®'s Atmos) where the topic has been studied by experts for years though

audiences are only starting to get acquainted with the recent, booming technologies.

However, although the commercial success of the new generation of 3D sound

 products is increasing and a fair number of movie directors report to be satisfied with

this new creative aspect of moviemaking, the money and space investments needed forthe consumers to acquire surround (let alone 3D) sound systems in their homes seems

to refrain them from investing in the immersive experiences. Are 3D soundtechnologies then dedicated to the movie theaters? Binaural technologies hold the

 potential to negate this idea, allowing consumers to not only enjoy 3D sound in thecomfort of their homes, but also bringing it to the mobile experience.


Head-Related Transfer Functions (HRTFs) are a mathematical attempt to

isolate the transfer functions containing all of the previsouly seen cues necessary for

the human brain to localize a sound source at a given point in space, in the form of afilter. They thus comprise both the ITD (encoded in the filter's phase spectrum) and the

IID (encoded in the filter's overall power), as well as the ear's frequency response

corresponding to the position where the stimulus was played back relatively to the

listener or mannequin. The measurement is made by playing some known stimulusthrough a loudspeaker placed in a free-field, whose position is stipulated at a given

azimuth (!), elevation (") and distance (Cheng & Walefield, 2001). The impulse

response is generally captured by small microphones placed in the listener's ears.HRTFs are oftentimes specified to be minimum-phase FIR filters. This

characteristic becomes very useful notably in the case of HRTF interpolation, where an

FIR filter can mimic the attributes of an HRTF and are reported to give perceptually

acceptable results (Kulkarni et al., 1995). In the case of real-time processing, such

 practices are of paramount importance in order to be able to output immersive audio,

and researches in that field have become increasingly important in the recent years.

It is well worth mentioning that the quality of the immersive experience is

going to depend directly on the quality of the HRTFs. Indeed, since each individual’s

 pinnae and bodies are unique, personalized HRTF should always be used. However,

although the measurement is quite fast and rather straightforward, it is not given to

anybody to have their own HRTF measured, as the procedure notably requires specificgear as well as a calibrated multichannel system. That is the reason why several experts

of the field have applied themselves to study the different physical factors influencing

the HRTF measurements in order to gain understanding as to build “general” HRTF

databases that would eventually allow listeners to choose HRTFs that best suit them.

Indeed, it is not required for the listener to have his/her own HRTF in order to have

 perfect localization abilities, as some studies have shown that it is possible for humansto adapt to another way of localizing sounds.


The two current sound technologies are the channel-based model and the

object-based one. Their purposes are similar and their outcomes in the immersion

realm are relatively close. However, their perspectives are quite different from each

other, and their suitabilities vary according to the application that they are submitted to.

Unsurprisingly, the channel-based model holds channels as reference. In the recording

and/or mixing process, each channel will be attributed one signal, which is meant to be

reproduced over a speaker placed at the same relative position where it was first

intended to be played back when the recorder/mixer approved the content. Therefore,

the use of standard speaker layouts has become widely accepted in order to provide a

reference system for the audience to enjoy the content the way it was meant to be. The

inconvenience of this channel-based technology is that mixes approved in one givenspeaker layout can not translate into another one without using up- or down-mixing,

thus a priori forcing the content providers to mix their materials several times, adding

to the production costs (although some mixing techniques can be used to overcome this

limitation to a certain extend, which will go unspoken of in the present paper).The object-based model does not rely on the same channels concept, but rather

handles sounds as objects. Each object is assigned a position coordinate on axes X, Y,Z, which varies according to a timecode. The purpose of this system is to be able to

automatically rescale the reproduced mix to the available system layout, thus allowing better flexibility. However, the reproduction of object-based spatial audio requires the

use of decoders in order to render the sounds correspondingly to the current system

setup. This characteristic of the object-based model can be problematic in some cases,as it raises the question of the absence of true referencial “master”.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

The most common way used to translate spacial audio over headphones usingHRTFs requires the use of convolution. Let us process a monaural signal x[i] so that a

listener could localize it at a given azimuth !  and elevation ". The result of the processing yields yl[n] and yr [n], that are to be played back simultaneously on a pair of

headphones, respectively by the left and right membranes. The HRTF database used inthis research is the ARI database [6], which contains 256 samples-long HRTFs. Since

HRTFs are often referred to as minimum-phase FIR filters, let d min l," ,# 

 and d min r," ,# 


the minimum-phase impulse responses measured from azimuth ! and elevation ".

 yl[i] =   d min l," ,# [ j ] x[i $   j ]

 j =0

 M  $1


 yr[i] =   d min r," ,# [ j ] x[i $   j ]

 j =0

 M  $1

Great… But can we explain what convolution is in plain words? Just like

addition is the mathematical operation of combining two numbers into a third one,

convolution is the operation that allows to combine two signals (the input signal and

the impulse response) into a third one (the output signal). Its symbol is the star *,

which should not be confused with the multiplication symbol used in computer

 programs! Practically, the expression: “x[n] * h[n] = y[n]” could be translated into:

“signal x[n] is convolved with impulse response h[n] resulting in output y[n].” At this

 point, it is already worth mentioning that convolution is commutative, i.e. x[n] * h[n] =h[n] * x[n] = y[n].

An impulse is a signal whose points are all zeros, besides one. The delta

function, expressed #[n], is a normalized impulse, i.e. its only nonzero sample is

situated at index zero and has a value of one. When the delta function enters a given

linear system the output file is called an impulse response, h[n]. However, any impulse

can be expressed as a shifted and scaled version of the delta function; for instance, let

us consider signal d[n], composed of a sample that has a value of -2 at index n=4, and

whose other samples are all zeros. Signal d[n] is thus a delta function, only shifted to

the right by 4 samples and multiplied by -2. Therefore, d[n] = -2#[n-4]. We say that an

impulse response is the definition to a system of convolution, because when its identity

is known, we know how any signal is going to react when passed through the system.

Actually, the impulse response is the system. Convolution being a paramount building

 block of digital signal processing, it is worth noting that the term used to refer to theimpulse response of a system can vary according to the application. Indeed, it is calleda point-spread function in the field of image processing, or a kernel if the considered

system is a filter. As previsouly mentioned, our HRTFs are considered to be filters totheir input files, therefore kernel would be the right term to use in our case.

In most practical cases, the input files of convolution are several thousandssamples long, while the impulse responses are usually much shorter. In our case, the

input files are going to be the audio signals, while our kernels will be the HRTFs,

which are, as mentioned, 256-samples long. The ouput files will be the same audio

signals, but spacialized so the brain is able to localize them at the intended spot in

space. The number of samples contained in those output files will show to be of great

importance in the proposed models presented in further sections. Fortunately, theformula used to calculate this number is very simple; the number of samples in the

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

output files equals the number of samples in the input file, plus the number of samples

contained in the kernel, minus one.

While convolution can certainly be approached on several different perspectives, the short introduction presented here will merely allow one. The point of

view that shall be focused on is called the “input side algorithm” and will teach us how

the input signal contributes to the making of the output signal. Although the input sidealgorithm perspective does not provide a good mathematical understanding ofconvolution, it does allow to gain some conceptual insights on the process of

convolution, which is exactly what we are aiming for in this current section.

Let us use a simple example of convolution for a 9-point input signal x[n] and a

4-point impulse response h[n].11

  The input signal can be decomposed into discrete

samples that can then be considered shifted and scaled versions of a delta function.

Therefore, when looking sample x[2] (“sample situated at index 2 in x”), which has a

value of two, we see that it can be expressed as 2#[n-2] because it corresponds to a

delta function multiplied by 2 and shifted two indexes to the right. After passing

through the system, this component of x called x[2] becomes 2h[n-2]. We can visuallyverify this concept on the second box of figure XXX where the little diamonds serve as

“place holders” in each box, and are just added zeros, while the squares represent the

actual contributions from each point of the input signal x[n].

Very briefly, the input side algorithm works as follows (see Figures 9 and 10):

once the vectors are placed into their respective arrays (x[] for the input file, h[] for the

impulse response) and the programming usual practices are taken care of in the script

(notably zeroing the output array y[] because it serves as an accumulator and therefore

the variable needs to be reinitialized before each execution) two  for   loops will be

initiated. The first loop allows to go through every single index of x[] to individuallylook at all of the input signal’s samples. For each of them (still associated to modified

delta functions), a second, inner loop allows to calculate a shifted and scaled version of

the impulse response contained in h[]. Each result is then added to the output array y[].

11 The schemes used in this example were taken from the excellent The Scientist and Engineer’s Guide

to Signal Processing written by Steven W. Smith (1997)

Figure 10. Representation of the convolution between signal x[n] and impulse response

h[n] yielding signal y[n]. Taken from "Digital Signal Processing - A Practical Guide for

En ineers and Scientists" written b Steven Smith.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

Figure 11. Representation of the input-side algorithm. Taken from "Digital Signal Processing - A

Practical Guide for Engineers and Scientists" written by Steven Smith.


This section serves the purpose of giving a very short introduction to the verylarge topic of digital filters. The goal is to convey some insights as to the way our filter

kernels (HRTFs) are going to interact with our input signal.

Every filter is characterized by three main attributes: the impulse response (i.e.

its filter kernel) that makes it possible to find the step response and the frequency

response. It is worth noting that all three of those attributes actually represent the same

information, only described from different perspectives. Indeed, it is no problem to

convert the information found in the impulse response to obtain the step response or

the frequency response. Indeed, integration12

 of the impulse response allows to find the

step response, whereas doing a DFT (by means of the FFT algorithm) of this IR allows

to find the filter’s frequency reponse.

The realm of filtering is one of decisions; indeed, there is no such thing as a perfect filter. Therefore, the filter’s characteristics are to be adapted according to its

function. Much in the same way seen in the central auditory nervous system, where the

stimuli from the cochlear nerve were split into two distinct pathways respectively

handling time and frequency information for the good reason that such system wasnecessary in order to preserve the features of the stimuli relevant to each stream, filters

are not able to be performant in both the time and frequency domains at the same time.Therefore, the step response can be focused on if the application requires high time

domain resolution, while the frequency response can be improved if the filter is to be


 Or, to be mathematically correct, « doing the running sum ». Indeed, integration is an operation thatapplies to continuous signals solely, whereas the running sum is the appropriate term when dealing with

discrete signals.

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

used in an application demanding a good frequency resolution. Let us take a closer

look at their respective parameters.

First of all, what is the step response? And before that what is a step function?In order to answer this second question, it can be useful to note how we, humans,

interpret signals… Our brain is capable of dividing the stimuli into regions of similar

characteristics (such as noise, then high amplitudes, then low amplitudes, etc.) byidentifying the turning points between those regions, i.e. the points that separate them.

Figure 12. Representations of the good and poor characteristics of a filter designed for a time-

domain application. Taken from "Digital Signal Processing - A Practical Guide for Engineers and

Scientists" written by Steven Smith.

That is exaclty what the step functions are: turning points between zones of

similar characteristics. The step response (that can also be found by doing the running

sum of the impulse) results from feeding a step function into a given system, i.e. in our

case, a filter. Basically, the step response will show us how the step function was

affected by the filter. The step response is composed of three main parameters:

risetime, overshoot and linearity. In order to design a filter for use in the time domain,

the risetime needs to be shorter than the spacing of the events, in order to provide good

resolution. The step response should not overshoot because it distords the amplitude of

samples in the signal. At last, the linearity of the filter is determined by the fact the

upper half of the step response is a point reflection of its lower half (see Figure 11).

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

Figure 13. Representations of the good and poor characteristics of a filter designed for a

frequency-domain application. Taken from "Digital Signal Processing - A Practical Guide for

Engineers and Scientists" written by Steven Smith.

Filters used for applications in the frequency domain have three main

 parameters: roll-off, passband ripple and stopband attenuation. A fast roll-off allows anarrow transition band13, there should be no passband ripple in order for the signal we

want to keep to not be affected by the filtering, and the stopband attenuation should bemaximal.

 Now that we can look at digital filter a little more clearly, we can look at thetwo possible ways to filter an input signal: convolution and another process called

recursion. Convolution allows to create FIR (Finite Impulse Response) filters whilerecursion makes up for IIR (Infinite Impulse Response) filters. In theory FIR filters are

fantastic in our case because they have the great feature of not messing around with the

 phase of our input signal, which will show to be of paramount importance in further


13 A transition band is the band situated between the pass-band and the stop-band, i.e. the band it takes to

go from -3dB of the pass-band to the stop-band. It is worth noting that while this claim is correct in theanalog realm, the transition bands in the digital realm were never really standardized and are often

stipulated in percentage (99%, 70,7% —which equals -3dB—, 50%, etc.).

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

The goal of the converting algorithm is to output stereo, spacialized files that

can be as immersive as possible for the listener. Although the script is offline and the

computing time is not of paramount importance, the algorithm must be written

efficiently enough to allow future changes for possible real-time adaptation. Human

 psychoacoustics considerations being of extreme importance for the immersive quality

of the output files, they must be placed at the center of the design goals. Practically,

those physiological considerations translate into the following propositions:

(1) The converter must provide as many virtual sound source positions as

 possible. The MAA (Minimum Audible Angle) is the basic metric of relative

localization ability of the listener, and is thought of as the smallest angle detectable by

humans in azimuth or elevation for a sound source (Letowski & Letowski, 2012).Therefore, the MAA is a good indicator of the resolution of the auditory localizationsystem. On the azimuthal plane, humans showed the ability to discriminate changes of

only 1° or 2° in the frontal position when wide-band stimuli and low frequency toneswere played (Grothe et al., 2010). Those values were reported to increase to 8-10° at

90° and decrease again to 6-7° at the rear (Letowski & Letowski, 2012). The MAAreported on the elevation plane is about 3-9° in the frontal hemisphere, and almost

twice as large in the rear hemisphere (at 60° in elevation) (Letowski & Letowski,2012). However, it is well worth mentioning that the MAA does not quantify absolute

localization judgements, but only relative ones. The reported measurements were much

larger for the average error in absolute localization for a broadband source: 5° for the

frontal and about 20° for the lateral position (Hofman & Van Opstal, 1998). Thisvaluable information can be useful in estimating the relevance of the ARI HRTF

database used in the present research. The HRTF were measured in incremental steps

of 2,5° in the azimuthal range of ± 45° and of 5° outside this range. The elevation was

measured in increments of 5°. From these facts, we can draw the conclusion that the

ARI database has reasonably good resolution for HRTF interpolation not to be

considered in the case of the present research.

(2) The converter's time reference must be short enough to provide good

resolution in the sound sources' movement. Humans' ability to perceive sound motion

is effective through a series of cues: the main ones being the radial and the angular

velocities (Letowski & Letowski, 2012). The radial velocity is the one at which soundsources move towards or away from the listener, directly affecting the sound intensity

as well as inducing Doppler shifts in sound frequency. On the other hand, the angularvelocity represents the velocity at which sounds rotate around the listener and is

 perceived through monaural and binaural localization cues. Although the radialvelocity has little impact in the present research because the ARI HRTF database does

not include ear-source distance variations, the angular velocity turns out to be very

useful information to work with. The MAMA (Minimum Audible Movement Angle),

that is the primary metric used in reporting perceived sound source motion, is defined

as the smallest angular distance the sound source has to travel, so that its direction of

motion is detected. It could therefore be thought of as the detection threshold for

movement. The MAMA is the smallest in the listener's frontal plane and increases asthe sound source moves away to the sides of the head. Indeed, a minimum duration of

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

150–200 ms in the 0°–60° range of observation angles were reported and the durations

increased by ~25%–30% at larger angles for sound sources moving at low velocities.

150ms thus seemingly being the shortest time for a human to perceive sound sourcemotion, the time reference for the models was chosen to be inferior to that value,

namely 100ms, so that excellent movement resolution could be obtained.

(3) The converter must induce minimal phase artifacts.

5.2. Overview of the Channel Based Model

The converter is fed audio files in the form of channels. The simplest form of

algorithm would be to convolve those channels with the HRTFs corresponding to the

 physical positions of the speakers in the required layout, effectively creating static,

"virtual speakers". However, the sense of immersion induced by this technique is

limited because of the very few sound source positions available. In order to meet the

first requirement of our design goals, two specific categories of sounds are to be

distinguished: static sounds and sounds evolving in space. The implementation of the

following process is suggested in order for the program to distinguish between both

categories. Nonetheless, it is worth noting that the presentation of such process does

not aim at providing an exhaustive and precise implementation strategy for the

channel-based algorithm, but rather is used as a mean to recognize the type of

 processing required for the binaural auralization of 3D channel-based content.

First of all, a timecode is applied to the audio content in order to provide it witha time reference system. As justified in section 3.1, the time reference is the tenth of

second. The signals contained in every channel are then analyzed to reveal theirfrequency domains in order to have knowledge of the energy contained in each of their

sub-bands. However, because the signals that are dealt with in the present application

are usually non stationary, the proposed method for such process requires the use ofwavelets transforms as opposed to the traditional Fourier transform, which is not

suitable in this case because of its lack of precision in the revelation of a non stationary

signal's temporal structure. Through the scaling and the time shifting of the mother

wavelet function, the input signal can effectively be analyzed and reveal its spectral

content as intended.

The next step in the process leads to a complex comparison system of the all of

the channels' sub-bands' RMS values with a windowing-time of 100ms. The purpose of

such system is to monitor the spectral activity of the channels in order to draw

conclusions about the spatial evolution of sounds from one channel to the other. Each

channel's sub-bands are compared to the corresponding sub-bands of the channels

 played back on speakers whose physical positions are adjacent to the analyzed channel,100ms later. Let us exemplify this idea using Barco®'s Auro 11.1 speaker layout (L, C,

R, Ls, Rs, HL, HC, HR, HLs, HRs, VoG and LFE), with the analysis of the Right

channel's sub-band centered on 1KHz. This Right channel's sub-band's RMS value at

time 00:00:00:10 will be compared to the RMS values of the same sub-bands (i.e.centered on 1KHz) of the C, HC, HR, HRs and Rs channels at time 00:00:00:20. For

simplicity's sake, let us continue expressing the present idea with only one of R'sadjacent channels, namely C. When comparing the RMS value of R's sub-band

centered around 1KHz with C's, three different outcomes can be expected: the RMSvalue of the R's 1KHz sub-band at 00:00:00:10 can be either greater, equal or smaller

than C's at 00:00:00:20. Depending on the reached outcome, logical conclusions can be

drawn from these observations: if R's sub-band is quieter than C's, chances are likelythat whatever sound containing energy at 1KHz, is evolving from the right to the

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

center. Inversely, if R's sub-band is louder than C's, the sound is evolving form the

center speaker to the right one. If the energy contained in R's sub-band is the same as

C's, the sound scene is likely to be static at that time. However, this form of spectraltracking method is far from being infallible: for example, it could not track sounds

whose spectral content would be evolving along their positions in space.

Channels are then separated into sub-bands using pass-band FIR filters, andeach channel’s sub-bands are convolved with HRTFs corresponding to the positionsretrieved from the analysis of the channels’ spectral contents method explained in the

 previous paragraph; static sound sources are convolved with the HRTFs corresponding

to the virtual speakers positions whereas sound sources evolving in space are

convolved with HRTFs corresponding to intermediate positions between those virtual

speakers. However, this method of binaural spacialization yields phase distorsion of

the channels’ signals, which goes against proposition (3) of the design goals.

This brief explanation allows one to start realizing the complex development

 procedures required in order to binaurally translate 3D channel-based audio contents

while aiming to reach the design goal expressed in 5.1 stating "The converter must

 provide as many virtual sound source positions as possible". Indeed, in order to do so asystem of detection of wave patterns coupled with minimum-phase FIR filters (whose

qualitative performances can be very high if properly designed but would subsequently

show poor computational efficiency), should allow the script to perfectly "crop" every

single elements of the audio content in order to individually convolve them with the

HRTFs corresponding to their positions. Although nowadays such procedure would be

impossible to achieve with perfect results, it would effectively turn channel-basedcontents into object-based ones.


Sounds are considered to be objects, each with their own set of spatial

coordinates in regard to the time-reference. Since the HRTF measurements from the

ARI database only include the direction variation of the incoming signal and not the

ear-source distance (like most of the available HRTF databases), only two axes are

relevant in our position coordinates system: azimuth (!) and elevation ("). The

azimuth parameter will have increments of 2,5° from -45° to +45° and of 5° for the rest

of the sphere.  The elevation parameter will have increments of 5° throughout. As

explained in 5.1 the time reference is the tenth of a second in order to provide good

locational accuracy when the need arises to process objects that quickly evolve in


The spatial coordinates and the time-reference (timecode) for each object arestored in a .txt file. The purpose of the program is to read into the object’s own .txt file

to use its coordinates in regard to the timecode, associate its coordinates to the

corresponding HRTF, and convolve this HRTF with the object. For best efficiency, a

function allowing the coordinates/HRTFs association can easily be built into the program so it does not have to be recreated during each execution. 

During each 100ms-window, a number of samples from the object areconvolved with the HRTF corresponding to their coordinates. The number of samples

will depend directly on the sampling rate of the .wav object. For example, an objectwhose sampling rate is 44100Hz is segmented into pieces every 4410 samples. Those

4410 samples are then convolved with the HRTF corresponding to their coordinates at

that point in time. All of ARI's HRTFs’ lengths being equal to 256 samples, accordingto basic convolution rules the number of samples resulting from this single operation

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

will thus amount to 4410 + (256-1) = 4665 samples (which will be referred to as

"window convolution output samples"). When these window convolution output

samples are placed side by side into a vector, a system of overlap ensures that the totalamount of samples does not increase as a side effect of the convolution process, and

allows as well to not disregard the valuable information contained in the tails of the

window convolution outputs. In practice, still in the case of an input signal originallysampled at 44,100Hz, the first 4665 samples resulting from the first convolution arelaid into a vector, but the second "load" of 4665 samples will start at index 4409,

adding themselves with the remaining 255 samples from the first window convolution

outputs. This operation goes on until all of the object’s samples are processed.

As explained earlier, HRTFs work in pairs. Therefore, for each object, this

aforementioned process requires to be carried out twice: once for each transfer function

at any given position.

Once the convolution process is finished for all of the objects, their vectors can

then all be padded with 0’s in order for them to have the same vectorial dimensions.

After padding, two variables that will be referred to as “Left Ear Bucket” and “Right

Ear Bucket” are initialized. Those so-called buckets will respectively gather all thevalues stored in the vectors affected by the “left ear HRTFs” (in the Left Ear Bucket)

and by the “right ear HRTFs” (in the Right Ear Bucket).

The last step concerns normalization. In this case, the EBU R128 standard was

chosen in order to normalize signals to an appropriate level and is implemented

through the use of C++ libraries made available.

Independently from these considerations, a good question to raise is one thataddresses the capability of the object-based model to reproduce diffused-field audio.

Given the fact that, by definition, one object can only hold a maximum of one positionat a given point in time it would then be impossible to reproduce a soundcape recorded

with a microphone array using objects solely. Such sounds belong to the channel-baseddomain and are commonly referred to as “beds”. They require the use of up- or down-

mixing in order to be adapted to the number of speakers available in the object-based

reproduction system used.


Although the perspectives of the channels- and object-based implementation

models presented in this paper are very different from each other, they actually

complement each other nicely. Indeed, as we have seen, the channel-based model is

very indicated for sounds containing diffused-field, since such sounds surround the

listener and thus originate from several positions. However, this model does not allowthe convenient binaural localization of sound sources evolving in space. Those types of

sounds are best handled by the object-based model, which allows to easily associate

the individual sound sources positions to the corresponding HRTFs. The creation of a

hybrid algorithm including the respective strength of both the presented models would be promising and hold much potential for the field of binaural conversion of 3D

contents.As a side note, another conclusion can be reached: in order to have the best

output quality as possible, one necessarily has to conceive a real-time algorithm.Indeed, as we have seen, good localization possibilities would rely on the listener’s

ability to rotate his/her head in order for his/her brain to improve the auditory object’s

localization. Thanks to several different available face-tracking systems (webcams,optical, electromagnetic, etc.), the listener is able to rotate his head while the system

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

interprets those rotations and maintain the soundstage up front. However, for best

results the latency should be minimal and a system of HRTF interpolation should be

set up in order to improve the resolution of auditory objects movements.


This work served the objective of introducing three-dimensional hearing onthree different perspectives, namely the body’s, the brain’s and the machine’s.

Although it is obvious that entire books could be written about every little section of

this paper, my intention was to provide my readers with a glimpse into several very

different concepts that are part of fields of study that usually do not blend in with each

other. However, it is my firm belief that that is where the future resides: nowadays the

complexity of the knowledge can be such that researchers become extremely

specialized within a single area of their field, therefore sometimes losing the general

view. I believe that getting interested in several fields of study is a good way to keep

inspired and to keep this notion of ensemble because in the end everything is inter-

related. How fascinating!

Let us review what we have gone through. Starting off in Chapter 1 with some

 physics fundamentals required for the proper understankding of the second one, we

notably went over the defining concepts of a sound wave and its propagation in air

 before moving on to discussing the impedance of a medium. The Fourier analysis wasthen briefly introduced before a word on the methematical definitions of linearity.

Chapter 2 started directing us more toward the subject of this paper. Indeed, theBody section discussed the outer ear, including its composition, roles and a little focus

on the important role of the pinna. We then discussed the middle ear and its ossicles, paramount in the impedance transformation process, which is a concept that was

introduced in the first chapter. We also went over the action of the muscles of themiddle ear that allow to reduce the intensity of the soundwave when entering the oval

window. Then, the inner ear and the cochlea were introduced. We went over the inner

ear’s composition while focusing on the cochlea’s important role.We saw its anatomy,

mechanics and discussed its organ of Corti containing the hair cells and stereophilia

responsible for the electrochemical translation of mechanical phenomena. Still within

the cochlea, we saw its physiological functionings with a view on Von Békésy’s work.

The last section of this second chapter discussed the fluids contained in the cochlea’s

scalae, the perilymph and endolymph, whose very special composition pointed at a

very special role.

Chapter 3 started where Chapter 2 left us: with electrochemical impulses.

Welcome to the brain’s domain! We learnt about the different pathways it uses to preserve the precious information that requires to be refined through several different

stages before finally being summerized into a so-called auditory object. We hear! A

short description of the ascending pathways is provided pp. 17 and 18 with a veryuseful figure that helps to visualize our introductory voyage through the central

auditory nervous system. A second section with the third chapter allowed us to gainsome level of insight into the way we, humans, localize sound sources on the

horizontal (mainly relying on ITDs) and vertical planes (mainly relying on ILDs). Wealso addressed a short discussion about the cues involved in our ability to estimate the

distance from a sound source.

Chapter 4 initiates a shift in perspective on the three-dimensional hearing topic,

and aims at introducing several concepts whose understanding will show to be usefulin the discussions provided in Chapter 5. We introduced the concept of the Machine,

8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf

which is an attempt to present and compare the implementation strategies of two

algorithms (channel- and object-based) for the offline binaural conversion of 3D audio

contents. We then went over an historical introduction on why binaural conversion is asolution to a situation we are facing in today’s world, namely the fact that the multi-

channel sound solutions seem to be meant to remain in the movie theaters as the cost

and space required for consumers to acquire such systems is prohibitive in many cases.We subsequently continued on Chapter 4’s main purpose and went over the conceptsof HRTFs, compared channels and objects, then introduced convolution and digital


Chapter 5 brings us to the heart of the matter, with three propositions that

constitute the algorithm’s common design goals, relying on recent research in

 psychoacoustics. Section 5.2 overviews the channel-based model, while section 5.3

handles the object-based one, before concluding in section 5.4 on which is best for

what application.

Chapter 6 is the current chapter. Chapter 7 offers two appendices. The first one

is a « bonus »: it’s the MATLAB script for the channel-based « beds » presented

earlier, which I programmed it with the help of Thomas Pairon. This is meant to showwhat the Saving Private Ryan 5.1 audio files (burned on the attached CD) went


Chapter 8 presents the references I used to write this paper. You will find

references of the books, the websites and the cited work.

Thank you for your attention!

clear all;

close all;


LHRTFLeft = fopen('L0e030a.dat','r','ieee-be'); % L CHANNEL

LDataLeft = fread(LHRTFLeft,256,'short');


LHRTFRight = fopen('L0e210a.dat','r','ieee-be');

LDataRight = fread(LHRTFRight,256,'short');


CHRTFLeft = fopen('L0e000a.dat','r','ieee-be'); % C CHANNELCDataLeft = fread(CHRTFLeft,256,'short');

fclose(CHRTFLeft);CHRTFRight = fopen('L0e180a.dat','r','ieee-be');

CDataRight = fread(CHRTFRight,256,'short');fclose(CHRTFRight);

RHRTFLeft = fopen('L0e330a.dat','r','ieee-be'); % R CHANNEL

RDataLeft = fread(RHRTFLeft,256,'short');fclose(RHRTFLeft);

RHRTFRight = fopen('L0e150a.dat','r','ieee-be');

RDataRight = fread(RHRTFRight,256,'short');


LsHRTFLeft = fopen('L0e250a.dat','r','ieee-be'); % Ls CHANNEL

LsDataLeft = fread(LsHRTFLeft,256,'short');


LsHRTFRight = fopen('L0e070a.dat','r','ieee-be');

LsDataRight = fread(LsHRTFRight,256,'short');


RsHRTFLeft = fopen('L0e110a.dat','r','ieee-be'); % Rs CHANNEL

RsDataLeft = fread(RsHRTFLeft,256,'short');fclose(RsHRTFLeft);

RsHRTFRight = fopen('L0e290a.dat','r','ieee-be');RsDataRight = fread(RsHRTFRight,256,'short');



LLeft = wavread('SPR-L.wav');

LRight = wavread('SPR-L.wav');

CLeft = wavread('SPR-C.wav');CRight = wavread('SPR-C.wav');

RLeft = wavread('SPR-R.wav');

RRight = wavread('SPR-R.wav');

LsLeft = wavread('SPR-Ls.wav');

LsRight = wavread('SPR-Ls.wav');

RsLeft = wavread('SPR-Rs.wav');

RsRight = wavread('SPR-Rs.wav');

LFE = wavread('SPR-LFE.wav');


LConvHRTFLeft = conv(LLeft,LHRTFLeft); % L CHANNELLConvHRTFRight = conv(LRight,LHRTFRight);

CConvHRTFLeft = conv(CLeft,CHRTFLeft); % C CHANNEL

CConvHRTFRight = conv(CRight,CHRTFRight);

RConvHRTFLeft = conv(RLeft,RHRTFLeft); % R CHANNELRConvHRTFRight = conv(RRight,RHRTFRight);

LsConvHRTFLeft = conv(LsLeft,LsHRTFLeft);% Ls CHANNEL

LsConvHRTFRight = conv(LsRight,LsHRTFRight);

RsConvHRTFLeft = conv(RsLeft,RsHRTFLeft);% Rs CHANNEL

RsConvHRTFRight = conv(RsRight,RsHRTFRight);


TotalLength = [length(LConvHRTFLeft) length(LConvHRTFRight)

length(CConvHRTFLeft) length(CConvHRTFRight) length(RConvHRTFLeft)

length(RConvHRTFRight) length(LsConvHRTFLeft) length(LsConvHRTFRight)

length(RsConvHRTFLeft) length(RsConvHRTFRight) length(LFE) ];

L = zeros(max(TotalLength),2);for i = 1:length(LConvHRTFLeft)

L(i,1) = LConvHRTFLeft(i);L(i,2) = LConvHRTFRight(i);


C = zeros(max(TotalLength),2);

for i = 1:length(CConvHRTFLeft)

C(i,1) = CConvHRTFLeft(i);

C(i,2) = CConvHRTFRight(i);


R = zeros(max(TotalLength),2);for i = 1:length(RConvHRTFLeft)

R(i,1) = RConvHRTFLeft(i);

R(i,2) = RConvHRTFRight(i);


Ls = zeros(max(TotalLength),2);

for i = 1:length(LsConvHRTFLeft)

Ls(i,1) = LsConvHRTFLeft(i);

Ls(i,2) = LsConvHRTFRight(i);end

Rs = zeros(max(TotalLength),2);

for i = 1:length(RsConvHRTFLeft)Rs(i,1) = RsConvHRTFLeft(i);

Rs(i,2) = RsConvHRTFRight(i);end

LFE = zeros(max(TotalLength),1);

for i = 1:length(LConvHRTFLeft)

L(i,1) = LFE(i);



Bucket = [L(:,1) + C(:,1) + R(:,1) + Ls(:,1) + Rs(:,1) + LFE

L(:,2) + C(:,2) + R(:,2) + Ls(:,2) + Rs(:,2)];


BucketNorma = Bucket/max(abs(Bucket))/2;



