View
230
Download
0
Category
Preview:
Citation preview
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 1/48
Romain Boonen, AEDS 911
SAE Institute Brussels
September 2013
Three-Dimensional Hearing:
The Body, The Brain And The
Machine
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 2/48
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 3/48
TABLE OF CONTENTS
0 FOREWORD ...................................................................................................................................5 1 THE BODY: FUNDAMENTALS OF THE PHYSICS OF SOUND .............................................................7
1.1. THE PHYSICS OF SOUND : A BRIEF I NTRODUCTION........................................................................ 7 1.1.1. THE NATURE OF SOUND...........................................................................................................7 1.1.2. THE IMPEDANCE OF A MEDIUM................................................................................................ 8 1.1.3. THE FOURIER A NALYSIS ..........................................................................................................8 1.1.4. LINEARITY OF A SYSTEM .........................................................................................................9
2 THE BODY: PHYSIOLOGY OF THE EAR ........................................................................................ 10 2.1. THE OUTER EAR .......................................................................................................................... 10 2.2. THE MIDDLE EAR ......................................................................................................................... 10 2.3. THE I NNER EAR AND THE COCHLEA ............................................................................................. 13
2.3.1. A NATOMY OF THE COCHLEA.................................................................................................. 13 2.3.1.1.GENERALITIES........................................................................................................................ 13 2.3.1.2. ESSENTIAL MECHANICS OF THE COCHLEA ............................................................................ 14 2.3.1.3. THE ORGAN OF CORTI ............................................................................................................14 2.3.2. PHYSIOLOGICAL FUNCTIONING OF THE COCHLEA.................................................................. 14 2.3.3. SCALAE AS FLUIDS COMPARTMENTS: PERILYMPH AND E NDOLYMPH ....................................16
3 THE BRAIN: THE CENTRAL AUDITORY NERVOUS SYSTEM A ND FOCUS O N HUMAN
SOUND SOURCE LOCALIZATION................................................................................................... 17 3.1. ASCENDING PATHWAYS OF THE AUDITORY NERVE ..................................................................... 17
3.1.1. THE AUDITORY NERVE ..........................................................................................................18 3.1.2. COCHLEAR NUCLEI ................................................................................................................ 19 3.1.2.1. THE VENTRAL COCHLEAR NUCLEUS .....................................................................................19
3.1.2.2. THE DORSAL COCHLEAR NUCLEUS ....................................................................................... 20 3.1.3. THE SUPERIOR OLIVARY COMPLEX ....................................................................................... 20 3.1.4. THE LATERAL LEMNISCUS ..................................................................................................... 21 3.1.5. THE I NFERIOR COLLICULUS ................................................................................................... 22 3.1.6. THE THALAMUS’ MEDIAL GENICULATE BODY ...................................................................... 23 3.1.7. THE AUDITORY CORTEX ........................................................................................................23
3.2. SOUND SOURCE LOCALIZATION ................................................................................................... 24 3.2.1. THE HORIZONTAL PLAN ........................................................................................................24 3.2.2. THE VERTICAL PLANE ...........................................................................................................27 3.2.3. DISTANCE FROM THE SOURCE............................................................................................... 28
4 THE MACHINE: R EQUIRED BACKGROUND ..................................................................................30 4.1. I NTRODUCTION TO THE MACHINE ................................................................................................ 30 4.2. HEAD-R ELATED TRANSFER FUNCTIONS....................................................................................... 30 4.3. CHANNELS VS OBJECTS................................................................................................................ 31 4.4. CONVOLUTION .............................................................................................................................32 4.5. I NTRODUCTION TO DIGITAL FILTERS............................................................................................ 34
5 THE MACHINE: IMPLEMENTATION STRATEGY............................................................................ 37 5.1. DESIGN GOALS.............................................................................................................................37 5.2. OVERVIEW OF THE CHANNEL-BASED MODEL .............................................................................. 38 5.3. OVERVIEW OF THE OBJECT-BASED MODEL .................................................................................39
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 4/48
5.4. CHANNELS VS. OBJECTS : CONCLUSION ....................................................................................... 40
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 5/48
0 FOREWORD
When SAE Brussels presented this thesis project and requirements to theirstudents, the goal was clearly stated that the research was meant for them to specialize
in a specific field by allowing them to do their research on a topic that was studied in
class, and from there take their knowledge to the next level in order to help them
establish a specialty that would be a helpful skill for initiating a career. This project
was extremely appealing to me, because I saw in it a great opportunity to gain some
level of expertise in a field I had been attracted to: binaural 3D sound.
As I was starting to gather basic information about this field of study, it did not
take me a very long time to realize that the complexity of the matter resided in the fact
that it required at least a certain degree of expertise in several different technical fields,the most important ones namely being digital signal processing, psychoacoustics,
programming, mathematics and physics. At that time, all I had was my current (basic)expertise of sound engineering but, through the motivation I had, I felt empowered and
determined to stick with binaural 3D sound, enough to patiently gather all of therequired knowledge and eventually tackle that complexity in order to make it my own
field of expertise.
This paper is the result of nine months of dedicated work and is composed of
three main parts that I decided to respectively entitle The Body, The Brain and The
Machine. The majority of the time spent on this paper was dedicated to learning theskills necessary to create the models presented in the last part of the paper, and to
interconnect them. In this context I attended two days of conferences in May 2013about binaural 3D sound for broadcasters at the EBU headquarters in Geneva,
Switzerland. During that time a presentation was given by Poppy Crum (Dolby) aboutthe neuroscience behind spatial binaural sound, which was entitled « Neural
Sensitivities – HRTF Representations In The Auditory Pathway ». From then on, I
grew so intrigued and fascinated about the way humans hear (and more generally, the
way the brain works) that this presentation marked a turning point in my research. In
fact, I decided to include two large sections dedicated to the human hearing physiology
and psychoacoustics (respectively entitled “The Body” and “The Brain”), which were
notably meant to provide my readers with the background knowledge necessary to gain
insights into the binaural implemention models presented further on.
Indeed, the ultimate purpose of the last part of this work (The Machine) is to present
and compare binaural implementation models respectively based on the two current
digital sound technologies, namely the channel-based model and the object-based
model. The purpose is to theoretically assess their qualities and drawbacks to
binaurally reproduce 3D sound. It is worth mentioning that “The Machine” is the
subject to an AES Convention paper that I shall present on October 17, 2013 in New
York City.
First of all, I would like to thank our team of helpful supervisers as well as ourinspirational teachers (and really the entire SAE Institute Brussels) for giving their
students the wonderful opportunity to let them do their thing. As far as I am concerned,I know that every day spent at this school has contributed to making me a more
inspired individual and in the end a better person. What else could I have asked for? I
would like to give a special thank you to Robin Reumers for the time he invested insupporting the last sections of this work. I would like to acknowledge his deep
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 6/48
understanding of all things audio (and more) as well as his creative technical problem-
solving skills that turned out to be very useful more than once. I would also like to
thank my family, my lover Gracie and my six brothers for the support they have provided me with throughout these several long months that actually passed quicker
than I ever thought. I feel blessed!
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 7/48
1 THE BODY: FUNDAMENTALS OF THE PHYSICS
OF SOUND
1.1. THE PHYSICS OF SOUND : A BRIEF INTRODUCTION
The purpose of this section is to give a small introduction to some fundamentals
of sound physics that will turn out to be quite useful in our understanding of the human
hearing physiology. Let us dive in…
1.1.1. THE NATURE OF SOUND
What is a sound wave? A sound wave is a vibrational movement of air
molecules around their initial positions. It is important to realize that the propagationof sound waves is very different from the wind phenomenon where molecules are
flowing over large distances!
A sound wave is defined by two main attributes: its frequency and its amplitude
(or intensity). The frequency of a sound wave refers to the number of waves to pass a
specific point in space every second. Frequency, which is commonly specified in Hertz
(Hz) or cycles per second (c/s), holds the subjective correlate of pitch when perceived
by most of the living organisms. The amplitude of a sound wave refers to the
magnitude of the vibrating movement of an air molecule, and ends up bringing the
subjective correlate of loudness.1 However, other parameters depend directly upon
those two main attributes: a sound wave frequency defines its period (i.e. the length of
time necessary for the wave to execute one full cycle of air molecules pressure –decompression) and its wavelength (the distance in meters for the execution of one
period), whereas a sound wave’s amplitude dictates the consequent air pressure
variation, air velocity and air displacement. The air pressure relates to the level ofcompression of the air molecules. When speaking of pressure, it is important to clarify
that it is the atmospheric pressure of the free field, which the sound wave is travellingthrough that varies. However, a sound wave travelling in a free field has proportionally
very little impact on the average atmospheric pressure of this free field; indeed, a levelas high as 140dBSPL makes the general atmosperic pressure vary by only about 0,6%.
The air velocity relates to the rate of change of position of the air molecules, whereas
air displacement relates to the distance of displacement of those molecules around their
equilibrium position.Sound waves can be described using other parameters as well, and it is possible
to easily make them relate to each other using simple equations. For example, in the
case of a sinusoid, its peak pressure ( p) above the mean atmospheric one and its
velocity (v) relate to each other in the following way:
Where z represents the impedance of the medium.
1 Both the pitch and loundess perceptions will be discussed further on in this work, as such phenomena
belong to the realm of the brain and not the physical world per se.
p = z.v
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 8/48
1.1.2. THE IMPEDANCE OF A MEDIUM
The impedance of a medium is an appropriate concept to address at this point
because it will show to be of great importance in the physiology of hearing, in the form
of « impedance jumps ». The impedance of a medium can be thought of as itsresistance. For example, water holds much higher of an impedance than air, becausethe pressure required in order to produce a sound at a given intensity in water is much
higher than the pressure required to produce a sound with the same intensity in air,which is simply due to the fact that the density of the water molecules is much higher
than the the density of the air molecules. It will thus require proportionally more
energy to give those water molecules some velocity. In the SI system, the impedence
( z ) is measured in (N/m2)/(m/s) or N.sec/m3. In order to gain some understanding of
our water/air example that will matter later on in this work, let us associate it with
some figures… The impedance of air at room temperature (20°C) is about 413 N.s/m3
whereas the impedance of water is of about 1,5x106 N.s/m
3, which means that the
water « resistance » is about 3632 times higer than the air’s. When a sound wave propagates in the air at 20°C and meets a water surface, the change of impedance is
such that only 1/3632 of the incident wave’s intensity is transmitted into the aquatic
middle. The result of this impedance jump is that most of the sound wave is reflected
against the water surface according to the laws of physics. The following formula
allows to calculate the proportion of an incident wave propagating in a middle of
impedance z 1 that will be transmitted into a second middle of impedance z 2.
When pluging our air and water impedance values into the equation, the outputtranslates into a propagation of our incident sound wave into the water middle of only
about 0,11%. Converted into decibels, it equals a 30dB attenuation. In a furthersection, we will address how the middle ear manages to transmit a signal to the brain
despite such attenuation.
1.1.3. THE FOURIER ANALYSIS
A further concept to be introduced is the Fourier transform, which not only will
help us understanding the hearing
physiology in the scope of the cochlea, but
its implications are so broad that it will be
mentioned in the fifth part of this work
(“The Machine: Implementation
Startegies”), that concerns signal
processing. Joseph Fourier showed that it
was possible to decompose a complex
signal into a sum of sine waves. That
process is called the Fourier analysis and itallows to bridge the gap between a signal’s
time and frequency domain. When it is plotted with RMS value on axis y and
frequencies on x (as it is generally the
4 z1 z
2
( z1 + z2)2
Figure 1. Decomposition of a square wave ys into a
series of sine waves y1, y2, y3 etc. using Fourier
analysis. Retrieved from :
http://ffden-
2.phys.uaf.edu/212_spring2011.web.dir/daniel_ran
dle/ on Sept. 18, 2013.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 9/48
case), it is possible to easily know which frequencies compose a complex signal, as
well as their respective intensity (see Figure 1).
The analysis of an infinite sine wave represented in the time domain will resultin a single line in the frequency domain, indicating that wave’s frequency and
intensity. Very similarily, the analysis of an infinite square wave will show the
fondamental with its odd harmonics. However, this model is obviously purelytheoretical, as no signal is infinite. What happens with finite signals? The indication ofthe different frequencies will broaden and turn into bands, whose breadths are
inversely proportional to the length of the input signal. The longer the signal, the better
resolution we are able to get in the frequency domain.
One may be tempted to ask: « why sine waves? ». For several reasons: sine
waves are quite easy to handle mathematically. They also happen to represent the
oscillation of quite a few physical systems, therefore being very present in natural
phenomena. But probably the most interesting reason why Fourier analysis shows to be
of great importance in hearing physiology is because our ear actually performs this
process constantly, although to a limited extend. As mentioned above, this feature will
be discussed in the cochlea section.The reversed process of analysis (taking many sinusoids and adding them
together to form a complex signal) is called the synthesis, but will not be as useful as
the Fourier analysis in the scope of this work and will therefore not be addressed.
1.1.4. L INEARITY OF A SYSTEM
The last concept to be introduced in this section is linearity. Such notion will be
useful to describe and properly understand the different stages of the auditory system.A system is referred to as « linear » when it verifies two properties: superposition and
homogeneity. A system is non-linear when one of those conditions is not fulfilled.Mathematically, those properties are respectively defined as follows.
The superposition property states that for two different input x and y, both belonging to the domain of the function f :
Put in plain words, this equation tells us that the result of two or more inputs plugged
in at the same time is the same as the addition of the results from the inputs plugged
into the system separately.
The homogeneity property states that for any input x in the domain of function f
and for any real number k:
This equation tells us that if the input is affected a factor k , the output will be affected
the same factor k as well.
The fact that a system is linear implies one more important property: the
frequencies contained in the output of the system were present in the input signal in thefirst place! Indeed, a linear system does not generate new frequency components.
f ( x + y) = f ( x)+ f ( y)
f (kx) = kf ( x)
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 10/48
2 THE BODY: PHYSIOLOGY OF THE EAR
The auditory system is thesensory system that allows humans to
perform the mechanoelectrical
transduction of sound waves into
neural action potentials. This highly
complex system is situated outside (for
the pinna) and inside the temporal
bone (shown in red in Figure 2).
The human hearing system
comprises three main parts: the outer,
middle and inner ears. Their anatomies
and roles will be investigated in thissection. Nonetheless, it is worth noting
that the great complexity of the physiological side of human hearing
will only allow us to scratch the surface of this most fascinating topic in the scope ofthis work.
2.1. THE OUTER EAR
The outer ear consists of a partially cartilaginous shape called the pinna that
comprises a resonant cavity called the concha, which forms the entry of the ear canal
(also called the meatus) that leads to the tympanic membrane (also referred to as theeardrum). The outer ear fulfills two main roles: it helps localizing sound sources and
increases the intensity of the incoming sound waves.
The pinna holds a paramount role in this paper because it is one of the main
actors in our ability to localize sounds. Indeed, the pinna’s shape (which is veryindividual and can be quite different from a person to another) allows to spectrally
modify the incoming sound waves in order to give the brain the necessary cues neededto assess the sound sources’ positions on the vertical plane. The second important
aspect of the pinna is to funnel the waves reaching the pinna into the ear canal. This process allows to increase the intensity of the sound waves reaching the eardrum by
about 15 to 20dB in the 2,5kHz range (Wiener and Ross, 1946) in the form ofresonances produced either by the association of the concha and the meatus (2,5kHz
resonance), or the concha alone (5,5kHz resonance).
It is possible to measure the influence of the pinna on the waves coming from a
sound source at a known azimuth, elevation and distance. This information, which is
extremely valuable in the scope of this paper, is called the Head-Related Transfer
Function (HRTF) and will be discussed in further sections.
2.2. THE M IDDLE EAR
The middle ear consists of the ossicles (malleus, incus and the stapes, which is
also known as the stirrup) and acts as an intermediate step between the eardrum and thecochlea in the way of an impedance transformer. Indeed, the purpose here is to turn
Figure 2. The temporal bone in represented in red.
Retrieved from :http://commons.wikimedia.org/wiki/File:Temporal_bone.
png in July, 2013.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 11/48
acoustical energy into mechanical energy. Being attached to the ear drum, the malleus,
which is attached itself quite rigidly to the incus, vibrates at the same rate as the
tympanic membrane and the association of those two bones transmits the force to thestapes (about the size of a grain of rice), which is connected to the cochlea’s oval
window that will be discussed in the next section. Interestingly enough, those three
small bones stop growing very early in a newborn’s life, making them the same size asan adult’s.
As mentioned in the previous paragraph, the role of the middle ear is to
transform the impedance from the large, low-impedance eardrum to the small, high-impedance oval window. Without this middle ear section, the reflections due to the
impedance jump would be so high that only a fraction of the incident wave wouldmanage to enter the oval window, and the subsequent perceived level would be much
lower. The ossicles thus allow to substancially reduce this energy attenuation. At this
point, it is worth mentioning that the actual functioning of the impedance transforming
process is quite complex and since a thorough explaination of it would not
substancially help the proper understanding of the following sections, it shall remain
superficially covered. However, it can be noted that this impedance-transforming
process is supported by two principles. The first one is that since the stapes’ footplatein the oval window is much smaller than the ear drum where the vibrations are coming
Figure 3. Cross section of the temporal bone, revealing the main parts involved in the outer, middle and internal
ears. Retrieved from:http://www.directhearingaids.co.uk/index.php/33/how-hearing-balance-work-together/ in
August, 2013.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 12/48
from, it is logical to state that the energy is going to concentrate in a smaller area, thus
effectively increasing the pressure at the oval window. The actual increase is calculated
by the ratio of the two areas. The second principle, though less prominent, is caused bythe lever action of the incus. Being smaller than the malleus, the incus allows to
increase the force and decrease the velocity transmitted to the stapes.
What about the linearity of transmission of the ossicles? Guinan and Peake(1967) found that the stapes movement increased proportionally to the input up to130dBSPL for frequencies below 2kHz and up to about 140 to 150dBSPL for
frequencies above. Those results thus seem to point towards the linearity of
transmission in the ossicles up to those intensities and, although the system of
measurement used in that specific research would have only allowed to see detect 10-
20% of odd harmonics, there is likely to be no significant harmonics or
intermodulation products at lower intensities. However, it is worth mentioning that the
suggested linearity of the middle ear may be affected by static pressures applied to the
ear. Indeed, such pressures would make the joint connecting the malleus and the incus
more rigid and stretch the ligament connecting the stapes to the oval membrane.
Another element also influences the linearity of the middle ear beyond 75dBSPL: themiddle ear muscles.
Two main striated muscles attached to the ossicles act as protections to
damages in the inner ear. The tensor tympani is attached to the malleus (on the
eardrum’s side) whereas the stapedius muscule is attached to the stapes. When sound
pressure levels of frequencies below 1-2kHz become too important, the inner ear
muscles contract and allow to increase the rigidity of movement of the ossicles.However, their action is quite complex and they have shown to have repercussions in
high frequencies as well. It would therefore be correct to say that humans are equippedwith multiband compressors right in their ears… Wever and Vernon (1955) actually
showed that this muscle contraction reflex allows to keep quite a constant intensity ofstimulus reaching the cochlea for low frequencies beyond the reflex threshold (around
75dBSPL), effectively acting as a multi-band brickwall limiter !
Figure 4. Detail of the middle ear. Retrieved from :http://cueflash.com/decks/PHYSIOLOGY_OF_AUDITION_-_54 in August,
2013
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 13/48
2.3. THE INNER EAR AND THE COCHLEA
After the middle ear comes the inner ear, which is composed of the cochlea and the
bony labyrinth containing itself the vestibular system. The vestibular system is
responsible for the sense of spatial orientation and balance. We shall focus on the
cochlea, which is the central piece to our auditory system and by far the most complexone. Its intrinsic role is to convert the physical vibrations received from the action of
the ossicles into electrical information that the brain can recognize as sounds and its
basic understanding will require some chemical and electrical explainations.
2.3.1. ANATOMY OF THE COCHLEA
2.3.1.1. GENERALITIES
Anatomically the cochlea is a
coiled tube separated
lengthways into three sectionsknown as the scala vestibuli,
the scala media and the scala
tympani. Those three scalaespiral together from the base of
the cochlea (the larger side) tothe apex (the narrower, pointy
side), keeping their proportionsthroughout their turns. The
cochlea’s size is about 1cm inwidth and 5mm in height. The
proportion of the scala media
being smaller than the ones of
the outer scalae, the outer
scalae are led to have a
common separation, which is
an osseous surface called the
spiral lamina. This surface is
situated close to the modiolus,
which consists of the spongy
bone around which the scalae
turn approximately two and ahalf times. The modiolus
contains the spiral ganglionthat shall be mentioned again
later on. The Reissner’s membrane separates the scala vestibuli from the scala mediawhereas the basilar membrane divides the scala media from the scala tympani. The
basilar membrane notably serves as the surface on which lays the organ of Corti, whichcontains the auditory transducers that are called « hair cells ». The scalae contain fluids
called the perilymph (outer scalae) and endolymph (scala media). The two outer scalae
meet at the apex of the cochlea in an opening called the helicotrema, allowing for the
perilymph to connect. The scala media is a closed cavity whose endolymph does not
directy interact with the exterior.
Figure 5. Cross section of the cochlea, providing a good view of thethree scalae as well a detailed view of the contents of the scala media.
Retrieved from: see image.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 14/48
2.3.1.2. ESSENTIAL MECHANICS OF THE COCHLEA
When the stapes’ vibrations are transmitted to the oval window, it produces a
displacement of the fluids within the scala vestibuli, which is transmitted to the scala
tympanic through the helicotrema. This phenomenon allows the basilar membrane to
be displaced in a wave-like movement, along with the organ of Corti that is attached toit in the scala media, effectively allowing for the hair cells to be stimulated and totransmit electrical impulses onto the brain.
2.3.1.3. THE ORGAN OF CORTI
Those organ of Corti’s hair cells amount to about 15,000 in each of a human’s ear and
can be declined in two kinds: the inner hair cells (IHC), on one row situated on the
modiolus side of the cochlea (i.e. toward the inside) and the outer hair cells (OHC),
ranging from three to five rows (increasing toward the apex). The hair cells are found
inside the reticular lamina. From each hair cell sticks out the stereocilia2, which is the
part of the hair cell that acts as the initial sensory transducers. Stereocilia is made outof long filaments whose stiffness allows them to stand on the lamina and act as levers
in response to mechanical deflections. The longer of the OHC’s stereophilia are
embedded in the undersurface of a gelatinous body called the tectorial membrane.
Being attached on one side only (toward the modiolus) above the organ of Corti and
the basilar membrane, the tectorial membrane allows to create a deflecting movement
the hair cells according to movements of the basilar membrane. On inner hair cells, the
stereocilia is composed of three to five nearly straight rows, while on outer hair cells it
is composed of three to five V-shaped rows.
2.3.2. PHYSIOLOGICAL FUNCTIONING OF THE COCHLEA
When sound waves reach the eardrum, its vibrations are transmitted to the ovalwindow through the ossicles and the stapes. When vibrating, the membrane of the oval
window initiates a wave of movement of cochlear fluids, transmitting the fluids to theround window. This phenomenon causes the cochlear partition (i.e. the basilar
membrane and the organ of Corti) to move according to this transmitted wave’s position and patterns, effectively revealing the frequency content of the stimulus to the
brain once the hair cells are stimulated and the information is sent over to the
ascending pathway.
G. von Békésy was the one that pioneered the cochlear research and a lot of the
current knowledge on this matter is owed to his studies and experiments described in
Békésy, 1960. He analysed the movements of the cochlear partition on human
cadavers, was able to plot the travelling wave-patterns and drew conclusions from
them. As can be seen on the scheme, the amplitude of movement of the cochlear
partition is contained within an amplitude envelope, never exceding it. Some of
Békésy’s important findings could be summerized as follows:
(1) As we have seen, vibrations from the stapes at any frequency allow for a
specific travelling wave to be initiated within the cochlear fluids. The travelling
wave’s pattern and its peak location in the cochlear duct depend on the
frequency of the stimuli brought by the stapes.
2
It is important to mention that the specialized literature speaks of the stereocilia both as stereocilia andhair cell. We can comprehend the intended meaning according to the context, either evoking the whole
cell (stereocilia plus the part contained in the reticular lamina), or only the actual stereocilia.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 15/48
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 16/48
peak regions of the travelling waves of frequencies of 10Khz and higher. As a side
note, it is interesting to note that this feature seems to vanish after death.
Let us go back to this question of physical limitation of movement amplitude ofthe cochlea. In other words, why is it that, physically, the cochlear partition is not able
to be stimulated in a linear way? It is due to the common action of two rather simple
phenomena that can incidentally be repetitively found in Nature: the stiffness and masslimitations. In order to explain them, let us refer back to the cochlear amplitudeenvelope described by Békésy. The first phenomenon, referred to as the stiffness
limitation, explains the reason why the cochlear partition is not able to be fully
stimulated from the base to this resonance point situated closer on the apex side. The
cochlear partition was actually shown to be relatively rigid near the base, gradually
becoming more compliant as it progresses to the apex. This stiffness is the main factor
preventing the cochlear partition to move freely according to the way it is stimulated.
The second phenomenon, called the mass limitation, gives the reason why the cochlear
partition’s amplitude potential rapidly decreases from that resonance point on to the
apex: although the cochlear partition is now more compliant than it was at the base, its
larger mass and inertia limit its amplitude of movement.
2.3.3. SCALAE AS FLUIDS COMPARTMENTS : PERILYMPH
AND ENDOLYMPH
The two outer scalae, the scala vestibuli and the scala tympani, contain
perilymph whereas the scala media contains endolymph. Chemically, those two
extracellular fluids are quite different from each other. Let us give them a bit of a
closer look.
Contained in the outer scalae, the perilymph is very similar to most otherextracellular fluids because of being mainly composed of cation sodium (Na+) and, to a
much lesser extend, cation potassium (K +). Its electric potential is positive, and wasreported by Johnstone and Sellick (1972) to be of +7mV in the scala tympani and of
+5mV in the scala vestibuli, i.e. close to ground potential.The endolymph is contained in the scala media. Oppositely to the perilymph, its
chemical composition is mainly K + and, to a lesser extends, Na+. Endolymph is a very
unique kind of extracellular fluid. For two reasons: (1) given its general composition,
endolymph is very comparable to intracellular fluids and (2) its very high positive
potential — referred to as the endocochlear potential and varying from +100mV to
+83mV in a declining fashion from the base to the apex — has not been found in any
other extracellular fluid. Its chemical and electrical uniqueness therefore point at a very
specific role played within the cochlea. Indeed, according to the investigations that
have been undertaken, the endolymph’s mentioned characteristics were found to play
important roles in mechanotransduction as well as mechanical amplification of the
travelling waves propagating in the cochlea.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 17/48
3 THE BRAIN: THE CENTRAL AUDITORY NERVOUS
SYSTEM AND FOCUS ON HUMAN SOUND
SOURCE LOCALIZATION
3.1. ASCENDING PATHWAYS OF THE AUDITORY NERVE
Once the haircells’ deflections have produced electrochemical impulses
travelling through the auditory nerve fibres in the spiral ganglion, those impulses are
now sent through several different parallel pathways that shall be introduced in this
section. Such operating way allows the brain to simulateously extract multiple features
of the stimulus that will show to be of great importance in order for it to create a
representation of the so-called « auditory object ». For example, sound localization onthe horizontal plane relies mainly on interaural time and level differences (respectively
ITD and ILD), while sound localization on the vertical plane notably requires the
complex analysis of stimuli’s spectra. However, the proper analysis of spectral
information prevents from a reliable analysis of time information, thus showing the
need for parallel pathways of stimuli analysis.
This section is meant to give an idea of those different pathways used by the
brain to interpret the electrochemical stimuli received by the haircells, as well as
presenting the different important brain areas where the information is treated,
mentioning their cells compositions and their purposes. It is well worth mentioning that
is it difficult to explain each section’s functions individually because the auditory
nervous system is organized hierachically. Indeed, the information analysed in thelower stages of the process is sent over to higher stages that basically analyse the dataand only send over to the next stages the relevant information at the time in order for
the auditory cortex to eventually represent this auditory object we hear. The« resolution of representation » of this auditory object thus increase as the different
information analysed in the lower stages is put together and made sense of in thehigher stages.
For reference, a very simplified plan of the different stages of the ascending pathways would go as follows:
(1) After haircells deflections in the left cochlea the electrochemical impulses are
transmitted to the left cochlear nerve (auditory nerve) situated in the modiolus.
(2) The output fibres of the cochlear nerve branch. One end enters the left ventral
cochlear nucleus (VCN), while the second end enters the left dorsal cochlear
nucleus (DCN).
(3) The outputs of the left VCN enter both the left and right superior olivary
complexes (SOC). This fiber pattern is referred to as the trapezoid body; the
left DCN outputs directly enter the right lateral lemniscus nucleus (LLN).
(4) The outputs of the left and right SOC respectively enter the left and right LLN.
(5) From that point on, the left and right parts of the brain do not any longer
connect contraleterally (with the other side of the brain). The left LLN connects
with the left inferior colliculus (IC).
(6)
The left IC connects to the left medial genuticulate body (MGB) in thethalamus.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 18/48
(7) The left MGB connects with the auditory cortex. Both parts are able to
communicate back and forth.
3.1.1. THE AUDITORY NERVE
Figure 6. Representation of the ascending pathways of the central auditory nervous system.
Retrieved from: http://origin-ars.els-cdn.com/content/image/1-s2.0-S1527336908001347-gr3.jpg
in September 2013
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 19/48
The auditory nerve is situated in the modiolus, on the inner side of the cochlea.
Its afferent fibres are situated at the base of the hair cells, transporting the electricalimpulses from the cells to the auditory nerves then onto the brainstem. The efferent
fibres are placed around the same places, and they allow the brainstem to influence the
cochlea. Both the efferent and afferent fibres lead to the spiral ganglion in themodiolus.
Inner hair cells and outer hair cells are innerved completely differently. Shortly,
we can say that there are two types of afferent fibres: Type I (also called radial fibres,
comprising 90 to 95% of them) and Type II (also referred to as outer spiral fibres,
comprising the remainder of the afferent fibres). Each inner hair cell receives about 20
to 30 Type I fibres (according to Liberman et al., 1990) whereas each outer hair cell
receives about six Type II fibres. Every Type I fibre is connected to only one hair cell,
but Type II fibres branch and end up innervating about ten outer hair cells. However,
outer hair cells are not only connected to those few Type II fibres (compared to the
innervation of inner hair cells and Type I fibers), but they are linked to other synapses
coming from different afferent fibres as well.
3.1.2. COCHLEAR NUCLEI
The cochlear nerve sends information to two entities:
3.1.2.1. THE VENTRAL COCHLEAR NUCLEUS
Because of its specialization of analysis of time and intensity information, the
ventral cochlear nucleus contributes mainly to the pathway of binaural localization (on
the horizontal plane). Other contributions are given to the pathway of sound
identification. The ventral cochlear nucleus is itself composed of two areas:
• The Anteroventral Cochlear Nucleus (AVCN):
The AVCN contains a type of cells called « bushy cells » (named for the bushy
patterns of their dendrites), known for their effectiveness to rapidly and reliability
transmit the impulses they receive to the next stage. There are spherical and globular
bushy cells.
Spherical bushy cells transmit to the superior olivary complex the information
of the stimulus’ time of arrival. There, this time information will be compared to the
information of time of arrival coming from the other ear. On the other hand, the
globular bushy cells handle intensity information. Just like the spherical bushy cells,globular bushy cells send this information to the superior olivary complex, where the
intensity information from both ears is to be compared.The AVCN is thus responsible for sending information to the higher stages that
will turn out to be very useful in the scope of binaural sound localization in thehorizontal plane.
• The Posteroventral Cochlear Nucleus (PVCN):
The PVCN’s structure is slightly more complex than the AVCN’s in the sensethat it comprises four types of cells: globular bushy cells, octopus cells and two types
of stellate cells (T-stellate and D-stellate in the 95% - 5% proportions).
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 20/48
Octopus cells are useful for two main reasons. Firstly, they have a pattern of
response called the « onset response » because of their ability to fire very strongly at
the onset of a new stimulus. Secondly, they have an extremely high resolution ofresponse for transcients in ongoing stimuli (they can detect more than 500 transcients
per second!). Moreover, their spectral range of action is very wide. Therefore, it is
thought that octopus cells are specialized in the extraction of temporal fluctuations incomplex broadband stimuli such as the human voice.
T-stellate cells fire repetitively when they receive stimuli corresponding to a
sustained tone burst. However, their firing rate is not related to the frequency of the
tone. They send this information to several different areas that are part of this
ascending pathway. D-stellate cells shall not be described here.
Summerizing, we can say that the PVCN therefore gives contributions to two
pathways: binaural sound localization (on the horizontal plane specifically) as well as
sound identification.
3.1.2.2. THE DORSAL COCHLEAR NUCLEUS
The dorsal cochlear nucleus gives great contributions to the pathways of sound
identification as well as of binaural sound localization (but on the vertical plane this
time). It is composed of three layers, but we shall only focus on the second, most
important, pyramidal cell layer.
The pyramidal cells (also called fusiform) project primarily to the contralateral
inferior colliculus (i.e. on the other side of the brain) through the lateral lemniscus
nucleus. Unfortunately, studying them is a complex endeavour because of their strong
vulnerability to anaesthesia. However, we do know that their response patterns givecontributions to the sound identification pathway. Since this work tends to focus on the
localization of sound, those responses will not be presented and we shall focus a bitmore on the pyramidal cells’ contributions to the binaural localization pathway.
It is known that notches in the spectral content of the stimuli can strongly drivethe pyramidal cells, if the frequency of this notch is close to the frequency at which the
cell is tuned. Those notches are produced by the pinna and their frequencies are
strongly influenced by the elevation angle of the sound source. Evidence for this
explaination were found for cats that had lesions in those parts of the brainstem, when
we realized they could not any longer make reflex orientations of their heads upwards
towards the position of the sound source (Sutherland et al. 1998a, b and 2000). Indeed,
it is thought that pyramidal cells play an important role in the unlearned action, as it
was still possible for cats to learn to discriminate between sound sources at different
elevations using a behavioural conditioning task. Therefore, the dorsal cochlearnucleus certainly plays a role in the binaural localization of sound sources on the
vertical plane, but it must not be the only one.
3.1.3. THE SUPERIOR OLIVARY COMPLEX
Two trends can be discriminated from the output streams coming out of the
cochlear nuclei: the binaural localization pathway is served by the ventral stream,which is itself divided into one section relaying intensity information as well as a
second one relaying time information, and the identification pathway is served by thedorsal stream.
The dorsal stream is directly sent to the inferior colliculus (through the laterallemniscus), while the ventral streams enter the superior olivary complexes on both
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 21/48
sides of the brain. The first one, conveying intensities information, enters the lateral
superior olive (LSO) along with the same stream coming from the other ear, where the
intensities information conveyed in the streams of both ears will be compared. Much inthe same way, the time information from both ears will reach both medial superior
olives (MSO), one on each side of the brain, where timing information will be
compared. The seminal Jeffress model (Jeffress, 1948) suggests an explaination for thiscorrelation process and will be presented in a further section.
The LSO contains cells of the « IE » type. In this type of terminology, the first
letter represents the response of the contralateral ear (I = Inhibitory4) and the second
letter represents the response of the ipsilateral one (E = Excitatory5). In order to
examplify this concept, rough trends can be given as follows. Firstly, the ipsilateral,
excitatory ear alone is presented a tone. The IE cells’ firing rate is maximal. Then a
tone is introduced to the contralateral, inhibitory ear. As the intensity of that second
tone is raised, the firing rate decreases until it reaches a value close to zero, when the
contralateral tone intensity equals the intensity of the ipsilateral one. As will be
explained later on in this work, ILD are mostly relevant for high frequencies. That is
the reason why the LSO is mostly reactive to high frequencies.On the other hand (as previously mentioned) the MSO receives streams coming
from the bushy cells of the AVCN on both sides, conveying timing information.
Thanks to the spherical bushy cells’ ability to fire almost instantly, the nucleus is able
to very reliably compare both ear’s times of arrivals and thus retrieve valuable
information for binaural localization on the horizontal plane. How does it work? The
MSO has a very thin, sheet-like structure and is composed of a single layer of fusiformcells, most sensitive to low frequencies. Although we will not develop too much on
this topic, we can simplify and say that the timing information is compared thanks tothe fact that each fusiform cell is tuned to fire maximally at a given, characteristic
delay between both times of arrival. The treatment of the information of localization onthe horizontal plane in the higher stages of the afferent pathways is thus dependent
upon the quantity of electricity fired by each fusiform cell.
It should also be mentioned that a minority of EE cells is contained in the LSO,
allowing it to not only analyze intensities information (its specialty), but timing
information as well. Similarly, a minority of IE cells is contained in the MSO, allowing
it to not only process timing information but also intensities information.
3.1.4. THE LATERAL LEMNISCUS
The lateral lemniscus is a tract through which run the ascending pathways from
the superior olivary complex to the inferior colliculus. Two major nuclei are containedwithin the tract, known as the ventral and dorsal nuclei of the lateral lemniscus
(respectively VNLL and DNLL). Although a majority of fibers are connected to one of
the nuclei, some of them simply run through the tracts, entering the inferior colliculus
directly.
The VNLL is part of the monaural sound identification stream6, receiving its
inputs from axons of the contralateral ventral cochlear nucleus as well as other nuclei
that were not mentioned in this work for the sake of simplicity. Since it does not
4 Definition of inhibitory: « slow down or prevent (a process, reaction, or function) or reduce the activity
of (an enzyme or other agent). »5
Definition of excitatory: « characterized by, causing, or constituting excitation. »6 The stream that deals with the identification of sounds retrieved by a single ear, as opposed to binaural
information that would have been previously interpreted in the superior olivary complex.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 22/48
receive inputs from the MSO nor the LSO, it does not appear to be playing any role in
the binaural localization pathway. It projects ipsilaterally to the inferior colliculus in a
complex pattern. Although experts are still unsure about the actual role of the VNLL,Langner (2005) actually speculated that it could potentially be able to extract harmonic
relations between stimuli.
The DNLL is part of the binaural sound localization pathway, receiving outputsfrom the ipsilateral MSO, LSOs from both sides and contralateral cochlear nucleus. Itsrole in mainly inhibitory, allowing to eventually enhance the lateralization of sound
sources that was previously created in the lower stages of the ascending pathway. It is
interesting to note that, due to the lasting effect of its inhibitory projections (that will
not be explained here) its role also allows to enhance laterization by suppressing
echoes in echoic environments (Pecka et al., 2007).
3.1.5. THE INFERIOR COLLICULUS
The inferior colliculus could be seen as the most important « data center » from
the lower parts of the brainstem. Here, the vast majority of information previously
treated from the different pathways (mainly sound localization, sound identification,
and their respective « sub-pathways ») is connected and the image of the auditory
object that will eventually be perceived in the auditory cortex is starting to strongly
refine. Of course, this paramount stage of synthesis of basic elements entails a whole
new level of complexity. Situated close to the superior colliculus (which is itself theimportant integrative reflex center of the visual nervous system), the inferior colliculus
is composed of three divisions: the central nucleus, the external nucleus and the dorsalcortex. The central nucleus (ICC) is innerved mainly by fibres running through the
lateral lemniscus, while the external nucleus and dorsal cortex receive fibres that do
not run through the tract. Those two are in charge of treating information surroundingthe auditory system, only indirectly bringing an improvement to the eventual auditory
object. Instead, the « extra lemniscus » pathway (as it is referred to) also comprises
multisensory stimuli.
The ICC receives information from all four sources of binaural localization:
LSO (center of analysis of intensities differences), MSO (center of analysis of timing
differences), the DNLL (responding to both cues) and the DCN (dorsal cochlear
nucleus – retrieving information needed for the proper localization of sounds on the
vertical plane).
The ICC is tonotopically organized in laminae (thin layers of organic tissues).
Said differently, all of the fibres carrying different information related to a common
characteristic frequency will meet on the same layer of the ICC. Studies carried in therecent years have been able to suggest some of the interactions of the four sources of
binaural localization in the IC. Indeed, Loftus et al. (2004) showed that the low-
frequency laminae (where ITDs dominate) receive inputs from the ipsilateral MSO
(processing of ITDs) but also, interestingly, receive inputs from the ipsilateral LSO(processing of ILDs). On the other hand, the high frequency laminae were shown to
receive inputs mainly from the DCN (which, as a reminder, process high frequencynotches used in the localization on the vertical plane) and, of course, the LSO.
It is worth mentioning that, thanks to recent anatomical evidence researchershave grown to believe in the existence of further maps of information processing (apart
from the spectral one). Indeed, the laminae are two-dimensional and the spectral
organization covers only one axis. Some have suggested that this second dimension ofthe laminae would be home to a map of periodicity detection, but we have yet to prove
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 23/48
that claim. Another option (that shall be investigated further on) points at a topological
map dealing with phase correlation of stimuli in the process of creating the perceived
auditory space.The external nucleus receives inputs from the contralateral cochlear nucleus
(including its DCN), the ICC, the auditory cortex (on a descending pathway), as well
as somatosensory7
input from the dorsal columns and the trigeminal nuclei. The dorsalcortex receives information from the contralateral inferior colliculus as well asdescending inputs from the auditory cortex. Although we do not know for sure the
roles of those two nuclei, many have suggested that the nature of the input received in
the external nucleus seems to point at an auditory and somatosensory integrative area
allowing to launch the required reflexes triggered by certain sounds. This feature of the
auditory system is part of the so-called “diffuse” or “extra-lemniscal” system that was
previously mentioned.
3.1.6. THE THALAMUS ’ MEDIAL GENICULATE BODY
The medial geniculate body is the last auditory relay before the stimuli enter theauditory cortex, and, within the scope of descending pathways, acts as an intermediary
between this auditory cortex and the rest of the subcortical nuclei. Moreover, those
ascending and descending connections point at a grouping of the medial geniculate
body and auditory cortex as a functional unit.
Divided in three different units, the medial geniculate body only has one
section that seems to be involved in the lemniscal auditory pathway: the ventral
section. We will not be focusing too much on the other two, less specific areas of the
medial geniculate body. The ventral section mainly collects information from the ICC(just previously seen). Similarily to the ICC, the ventral section of the medial
geniculate body is tonotopically organized in a laminar structure, and it was suggestedthat a further functional organization was underlying the specific range of frequencies
(the functional groups were termed as “slabs”). The purpose of this ventral section issaid to further sharpen frequency resolution.
The other two sections of the medial geniculate body are the medial and dorsal
divisions. Part of the extra-lemniscal pathway receiving visual as well as
somatosensory information, it is worth mentioning that their responses can change as a
result of learning.
3.1.7. THE AUDITORY CORTEX
The auditory cortex is the functional unit where all of the previously gathered
information will be assembled in order to form an auditory object in the listener’s
mind. The very large complexity of this unit barely allowing us to scratch its surface in
the scope of this work, the main information here will be covered less precisely and
more abstractly.
The auditory cortex consists of a core unit, surrounded by a belt and a para-belt.
The core unit, mainly receiving inputs from the specific, lemniscal system, is itself
composed of three main sections: the primary receiving area (AI), a secundary area(AII) and further “association” areas. Although the information integration processes in
the auditory cortex are the same for every human, the actual neural responses will be
7
Definition from Oxford American Dictionaries: relating to or denoting a sensation (such as pressure, pain, or warmth) that can occur anywhere in the body, in contrast to one localized at a sense organ (such
as sight, balance, or taste).
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 24/48
notably a function of the listener’s genetics and previous exposure to this stimulus.
Moreover, more activity is detected for stimuli of current significance to the listener in
his environment. Once the information is analyzed by the core unit, it continues on tothe belt and para-belt for further examination.
If we summerize the stimulus’ course along the brainstem (from the auditory
nerve to the actual representation of the auditory object) in terms of “what” (soundidentification) and “where” (sound localization) pathways, we notice that at first thosetwo streams separate right before entering the cochlear nuchlei (for better analysis of
the involved cues), then progressively reunite in the lemniscal tract up to AI included.
There, as mentioned, the association of spatial and identity specific information allows
to drive the neural activity in a unique way, effectively representing the auditory object
to the listener. Indeed, different objects are represented through different (though
overlapping) patterns of neural activity in the auditory cortex. We hear! Coincidences
of previously experienced patterns of neural activity facilitate the integration of the
known stimuli. Again, the “where” and “what” streams segragate into discrete
pathways: on one hand, both the identity and localization of the sound are transmitted
to a further dorsal pathway in the brain (enabling to prepare for a potential consequentmotor response) forming a “where” or “do” stream, while a “what” stream continues
on a ventral pathway on to several different parts of the brain.
3.2. SOUND SOURCE LOCALIZATION
As we discussed in the precedent section, the process used by the brain to
create a sense of auditory space relies on several different features according to thetype of signal presented to the ears. Indeed, on one hand localization on the horizontal
plane relies on ITDs and ILDs, effectively using the superior olivary complex’s ability
to make sense of the interaural correlation. On the other hand, localization on thevertical plane as well as judgement of distance to sound source mainly rely on the
analysis of the spectral content of those sound sources. As a reminder, this information
is analyzed in the dorsal cochlear nucleus. Therefore, it would be correct to summerize
a little and write that localization on the horizontal plane mainly uses signals in the
time domain, while the vertical plane as well as the judgement of distance to the sound
source use frequency domain signals. This section aims at briefly presenting those
processes.
3.2.1. THE HORIZONTAL PLAN
Out of the three « dimensions » (horizontal, vertical and distance) localizationon the horizontal plane is the best understood. The early findings of Lord Rayleigh in
his Duplex Theory (1907) arguably form the core tenet of knowledge in binaural
hearing. He is the first one to have given an explaination for the ILD and ITD
phenomena, which contribute to intracranial images assimilated to « lateralization » of
the sound source, i.e. movement of the sound source to the left or right of the listener.
ILDs arise because of the physical dimensions of an incoming sound: very simply,
high-frequency contents of sounds coming from the contralateral side of the ear are
reflected by the head, creating an acoustic shadow. This reflection of high frequencies
has the effect of diminishing the energy-content of the sound reaching the contralateral
ear, thus creating a difference of level with the signal reaching the ear on the same side
as the sound source. As we now know, those ILDs are correlated in the lateral superiorolives.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 25/48
On the other hand, ITDs processing relies on a model that has been held as the
reference for over 60 years: the Jeffress model, first introduced in 1948. It aims to
explain the manner time information in binaural signals are correlated in the bilateralmedial superior olives. It consists of an array of
neurons serving as coincidence detectors, firing
maximally when reached by stimuli from both ears.Those detectors are innervated by axons of variablelength that effectively create a system of delay lines,
allowing for one stimulus to reach all of the detectors
but at different times (as can be seen in Figure 7).
This arrangement allows to create a topographic map
of ITDs since, when simultaneously reached by the
stimuli from both ears, the firing of the given detector
corresponds to a given spatial position of the sound
source. Interestingly, Stevens & Newman (1936) reported that human subjects showed
fewest azimuth sound source localization errors for frequencies below 1.5kHz and
above 5kHz, not only indicating that the brain must therefore use two localization
machanisms (respectively ITDs under 1.5kHz and ILDs above 5Khz, thus backing up
Rayleigh’s work in the process), but also that the confusion reported between 1.5kHz
and 5kHz must indicate that those mechanisms must act simultaneously within that
band. Those results show to be quite consistent with physiological reports made later
on: the phase-locking of the stimuli in the auditory nerve declines for frequencies
above 3kHz and are reduced to practically nothing around 4 or 5kHz, and the medial
superior olive that was discussed earlier on within the scope of the ITD ascending
pathway contains more low-best frequency neurones, while the lateral superior olive(ILD pathway) contains more high-best frequency neurones. Phase-locking?
Interestingly, neurones in the MSO are actually not sensitive to time differences per se but rather they rely on interaural phase differences between the two ears’ inputs
(McAlpine, 2005).As useful and intuitive as it is, the Jeffress model seems to be nothing but a
model. Indeed, the researchers that applied themselves to find anatomical evidence of
the delay lines concept presented by Jeffress (Smith et al. 1993; Beckius et al., 1999)
never found anything convincing enough to validate the model as factual. However,
ethological evidence (in the barn owl, whose hearing is incredibly developped and
subject to extensive research) have encouraged many to believe that those interaural
time correlations were actually the results of topological maps in which specialized
neurones are tuned to fire maximally at a give phase (without the delay lines, that is),
effectively giving the auditory object its azimuth. It is suggested that such map would be placed orthogonal or parallel to the tonotopic map that was previously discussed in
the section dealing with the central nucleus of the inferior colliculus. However, some
observations seem to deny this claim and it is an ongoing discussion among experts.
As a side note, it has been stated that the brain relies on ILDs and ITDs
correlation to assess the lateralization of a sound source. But what is meant by that?
Interaural correlation actually refers to how similar or dissimilar the signals of a given
sound source reaching the left and right ears are. Two equations are often encounteredin specialized literature, both yielding what is referred to as an index of correlation.
They are formally known as the normalized covariance and the normalized correlation.
Basically, when two signals whose analyzed features are perfectly similar they hold a
correlation index of 1.0. However, it is not my intention to burden this paper withmathematically complex computational models of interraural correlation so I shall not
Figure 7. Representation of the model
presented by Jeffress (1948). From
McAlpine (2005).
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 26/48
dig any deeper. Other kinds of computational models for the binaural processing of 3D
audio contents will follow soon enough in the Machine chapters.
Let us discuss the Lord Rayleigh’s findings again for a bit. Apart from the idea ofacoustical shadow formed by the head on the contralateral side of the sound source
position, which is relevant in high frequencies, Rayleigh also introduced the concept of
« cone of confusion ». The prestigious Oxford Reference website defines this cone ofconfusion as follows:
« A cone-shaped set of points, radiating outwards from a location
midway between an organism's ears, from which a sound source
produces identical phase delays and transient disparities, making the
use of such binaural cues useless for sound localization. Any cross-
section of the cone represents a set of points that are equidistant from
the left ear and equidistant from the right ear. »8
The cone of confusion is thus this cone
that could be drawn around a listener’s ear
that contains points whose ILDs and ITDs
values are identical, such that the listenercould get confused as far as the actual position
of the sound source.
A related psychoacoustical
phenomenon that has left wandering a number
of binaural simulation experts is the front-
back confusion.9 What does it consist of ?
Quite simply, front-back confusions consist in
the listener’s inability to decide is the sound
source emanates from up front or behind
him/her, or more so to localize a sound upfront when emanating from behind and vice
versa. They are thought to be mostly produced by confusing ITDs values for sound
sources belonging to this cone of confusion we just discussed. Indeed, for any azimuth
up front, the same ITD value exists for a sound source placed at the back. However, the
occurrence of such confusions can be greatly diminished when the listener is able to
rotate his/her head. Indeed, this new dynamic cue is needed to modify the perceived
ITD and ILD values, allowing to help the listener’s brain in making a more informeddecision as to where in space it should place the incoming stimulus. For example, if a
sound source is presented to an azimuth of +20° on the center-right of a listener andthat a front-back confusion occurs when the listener localizes the sound source at
+160° then slightly turning his/her head towards the right will permit to decrease the perceived ITD and ILD values, effectively allowing the listener to localize the sound
source at its actual position. However, as suggested in Wallach (1940) and shown inWightman and Kistler (1999) the actual movement of the head is not necessary for
diminishing front-back confusion. For example, if the listener is placed on a rotating
platform while receiving stimuli from a static sound source, the listener does not need
8 Definition retrieved from:
http://www.oxfordreference.com/view/10.1093/oi/authority.20110810104643902 on Sept. 17, 13.9
It is worth mentioning that such confusions mainly happen in experimental conditions and rarely under« normal » conditions of the everyday life. However, it is important to make mention of it as it will show
to be of great importance in binaural virtualization of content discussed in the Machine section.
Figure 8. Representation of Rayleigh's cone of
confusion. © 2007 howstuffwork.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 27/48
to rotate his/her head in order to be able to extract the relevant information provided by
the dynamic cue as long as he/she is aware of the direction of his/her relative
movement .The correct positioning of sound sources situated at the rear of the listener’s
head also depend on another stationary factor that is only enhanced by this dynamic
one we just discussed. I am speaking of the phenomenon happening when a soundsource is situated behind the listener and the high-frequency content is not able todiffract around the pinna, resulting in a form of low-pass filtering. As suggested by
Wightman and Kisteler (1997a) front-back differences are mostly indicated by level
differences in the 4-6kHz region.
3.2.2. THE VERTICAL PLANE
A good starting point in the discussion of localization
on the vertical plane would be to look at the results obtained
in Butler and Humanski (1991). Listeners were sat by a
vertical arch of seven loudspeakers, fixed to the beam from0° to 90° and positioned by increments of 15°. The testing
was organized under six different conditions: in Conditions 1
and 2 the listeners were presented respectively with 3kHz
low-pass then high-pass noise bursts originating in the LVP
(lateral vertical plane), and they were able to localize sounds
binaurally, i.e. using both their ears. In Conditions 3 and 4,
the same noises were presented binaurally but this time
originating from the MVP (median vertical plane).Conditions 5 and 6 were similar to Conditions 1 and 2, only
the listeners’ localization abilities were tested monaurally,i.e. using one ear only10. The researchers found that in
Condition 1 (when listeners were presented the low-passnoise in the LVP) they were very capable of localizing the
sound sources. This result was expectable given the previous
discussion we had: the listeners relied on the availability of
binaural information. However, the listeners performed
poorly at assessing the sound sources’ elevation in the MVP
(Condition 3) with the same low pass noise. Indeed, no cue
was available to relate to that elevation, as pinnae’s filtering
abilities that could have provided the necessary information
only appear at higher stimulus frequencies (Searle et al.,1975).
On the other hand, when the listeners were presented
the highpass noise bursts (in Conditions 2 and 4) they
performed (substancially) better, especially in the MVP.
Therefore, it seems clear that localization on the vertical plane depends mostly on the
pinna’s ability of distorting the stimuli’s high-frequency content in peaks and notches
(mostly between 4kHz and 16KHz — notably see Blauert, 1969) according to thesound sources’ elevation. This apply mostly in localization in the MVP, which is, as
10 Naturally, monaural testing allows to isolate the cue related to high-frequency contents from the ILDs
and ITDs, which are binaural cues.
Figure 9. Apparent elevation of thesound sources plotted against their
actual elevations. From Butler and
Humanski (1992).
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 28/48
we shall see, a great area of potential improvement in binaural reproduction of 3D
contents.
3.2.3. D ISTANCE FROM THE SOURCE
The ability of a listener in estimating distance from a sound source (withoutvisual capture) depends on his/her ability to (mostly unconsciously) determine the waythe original signal has changed through the propagation process, according to three
main factors: the relative intensity of a sound, the damping effect (i.e. the relativeintensity of high-frequency content) and the direct-to-reverberant energy ratio. This
section holds the purpose of briefly presenting those three important parameters. We
will then discuss the influence of visual capture of the sound source over the estimation
of its distance to the subject, as well as how the accuracy of this estimate directly
relates to the subject’s previous exposure to the perceived auditory object and room
acoustics.
The first factor, which is the relative intensity of a sound, does not quite come
as a surprise as it is well-known that sound waves propagating in free space loose 6dBin sound pressure every time they double their distance with their source. Therefore,
judgments of distance increase systematically when the relative sound pressure
reaching the eardrums is decreased. This feature points toward a system of internal
reference of the expectation of intensity of a given auditory object compared to the
actual occurence. The comparison of this occurrence on our internal scale allows us to
estimate the distance to the sound source of the given auditory object.
The second factor is very related to the first one, but is to be considered as
distinct nonetheless. It deals with the damping effect, i.e. the amount of high-frequencyenergy that diminishes as a function of distance due to atmospheric absorption.
Coleman (1968) supported this notion by showing that a low-pass-filtered signal (witha gentle slope) was consistently localized further away from the subject than the same
signal unaltered.Finally, the third main cue used used by listeners to assess the distance from a
sound source is the ratio of energies along the direct (i.e. direct field) and indirect (i.e.
diffused field) paths to the receiver. This ratio can be called the « direct-to-
reverberant » energy ratio. The higher the ratio, the closer the estimated sound source
position and vice-versa.
Researches have shown that the estimation of distance to sound sources using a
single modality (either auditory or visual) greatly vary when compared. For example,
listeners tend largely underestimate those distances when the actual position of the
sound source is more than about a meter away. When estimating the distance of asound source, one would expect that combining vision and hearing would always
improve the localization. Not so much. Indeed, Gardner (1968) found an effect (that he
termed the proximity-image effect ) that selects the closest rational visible location as
the apparent sound source position, even though it might be meters further. It is worth
mentioning that in Gardner’s study this effect was reported under anechoic chamber
settings, thus preventing from reverberation to bring further information to the listener,
but a subsequent research (Mershon et al., 1980) actually found that this proximity-image effect works almost as efficiently in reverberant environments as in anechoic
ones, whereas another (Zahorik, 2001) concluded that this effect is not as definite as previously thought and methodology difference did not allow to draw scientific
conclusions out of the comparison between the studies. This last study also notifiedthat throughout the experiment, listeners seemed to have improved their localization
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 29/48
skills within the given environment, suggesting quite clearly that, as obviously
expected, it is possible for the brain to learn from experiences in order to perform its
tasks more accurately.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 30/48
4 THE MACHINE: REQUIRED BACKGROUND
4.1. INTRODUCTION TO THE MACHINE
The ultimate purpose of the two “Machine” sections that are about to unfold are
to be considered an attempt to outline two different ways to apprehend the binaural
offline conversion of three-dimensional audio contents. The first section serves as an
introduction of contents and holds the purpose of introducing several concepts that will
show to be relevant in this technical endeavour, while the second section will actually
present and compare both suggested models. The first model will rely on channels
while the second one will rely on objects.
Because the research was based on the assumption that both models were to
reach the same design goals specified in section 5.1, the purpose of the present section
is to explore means to reach the said design goals. However, the actual testing of the presented models is not encompassed in the range of this study and will potentiallyform the object of further work.
Sound recording's history has taught us that ever since the appearance of thefirst sound-related technologies in the 19th century, the driving force behind this
industry's evolution through the decades is marked by the effort to improve the senseof immersion brought to the listener. In that perspective, a highly non exhaustive list of
seminal technical improvements would include the following technologies: thestereophonic sound, first patented by A.D. Blumlein in 1931 (see reference), the
general improvement of analog circuit's linearity throughout the 20th century
(transducers included), Disney's Fantasia technology "Fantasound" in the 40's, then the
quadraphonic sound in the early 70's, that ushered the way to Dolby® and DTS® 5.1surround sound systems, largely permitted by the digital technology revolution started
in the early 70's, which eventually matured more profoundly in the 2000's. The 2000's
have seen the democratization of higher sampling rates as well as a higher bit-depth
resolution, permitting a more realistic experience of sound. The 2010's are now
witnessing the "3D momentum" (with products such as Barco®'s Auro 3D and
Dolby®'s Atmos) where the topic has been studied by experts for years though
audiences are only starting to get acquainted with the recent, booming technologies.
However, although the commercial success of the new generation of 3D sound
products is increasing and a fair number of movie directors report to be satisfied with
this new creative aspect of moviemaking, the money and space investments needed forthe consumers to acquire surround (let alone 3D) sound systems in their homes seems
to refrain them from investing in the immersive experiences. Are 3D soundtechnologies then dedicated to the movie theaters? Binaural technologies hold the
potential to negate this idea, allowing consumers to not only enjoy 3D sound in thecomfort of their homes, but also bringing it to the mobile experience.
4.2. HEAD R ELATED TRANSFER FUNCTIONS
Head-Related Transfer Functions (HRTFs) are a mathematical attempt to
isolate the transfer functions containing all of the previsouly seen cues necessary for
the human brain to localize a sound source at a given point in space, in the form of afilter. They thus comprise both the ITD (encoded in the filter's phase spectrum) and the
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 31/48
IID (encoded in the filter's overall power), as well as the ear's frequency response
corresponding to the position where the stimulus was played back relatively to the
listener or mannequin. The measurement is made by playing some known stimulusthrough a loudspeaker placed in a free-field, whose position is stipulated at a given
azimuth (!), elevation (") and distance (Cheng & Walefield, 2001). The impulse
response is generally captured by small microphones placed in the listener's ears.HRTFs are oftentimes specified to be minimum-phase FIR filters. This
characteristic becomes very useful notably in the case of HRTF interpolation, where an
FIR filter can mimic the attributes of an HRTF and are reported to give perceptually
acceptable results (Kulkarni et al., 1995). In the case of real-time processing, such
practices are of paramount importance in order to be able to output immersive audio,
and researches in that field have become increasingly important in the recent years.
It is well worth mentioning that the quality of the immersive experience is
going to depend directly on the quality of the HRTFs. Indeed, since each individual’s
pinnae and bodies are unique, personalized HRTF should always be used. However,
although the measurement is quite fast and rather straightforward, it is not given to
anybody to have their own HRTF measured, as the procedure notably requires specificgear as well as a calibrated multichannel system. That is the reason why several experts
of the field have applied themselves to study the different physical factors influencing
the HRTF measurements in order to gain understanding as to build “general” HRTF
databases that would eventually allow listeners to choose HRTFs that best suit them.
Indeed, it is not required for the listener to have his/her own HRTF in order to have
perfect localization abilities, as some studies have shown that it is possible for humansto adapt to another way of localizing sounds.
4.3. CHANNELS VS OBJECTS
The two current sound technologies are the channel-based model and the
object-based one. Their purposes are similar and their outcomes in the immersion
realm are relatively close. However, their perspectives are quite different from each
other, and their suitabilities vary according to the application that they are submitted to.
Unsurprisingly, the channel-based model holds channels as reference. In the recording
and/or mixing process, each channel will be attributed one signal, which is meant to be
reproduced over a speaker placed at the same relative position where it was first
intended to be played back when the recorder/mixer approved the content. Therefore,
the use of standard speaker layouts has become widely accepted in order to provide a
reference system for the audience to enjoy the content the way it was meant to be. The
inconvenience of this channel-based technology is that mixes approved in one givenspeaker layout can not translate into another one without using up- or down-mixing,
thus a priori forcing the content providers to mix their materials several times, adding
to the production costs (although some mixing techniques can be used to overcome this
limitation to a certain extend, which will go unspoken of in the present paper).The object-based model does not rely on the same channels concept, but rather
handles sounds as objects. Each object is assigned a position coordinate on axes X, Y,Z, which varies according to a timecode. The purpose of this system is to be able to
automatically rescale the reproduced mix to the available system layout, thus allowing better flexibility. However, the reproduction of object-based spatial audio requires the
use of decoders in order to render the sounds correspondingly to the current system
setup. This characteristic of the object-based model can be problematic in some cases,as it raises the question of the absence of true referencial “master”.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 32/48
4.4. CONVOLUTION
The most common way used to translate spacial audio over headphones usingHRTFs requires the use of convolution. Let us process a monaural signal x[i] so that a
listener could localize it at a given azimuth ! and elevation ". The result of the processing yields yl[n] and yr [n], that are to be played back simultaneously on a pair of
headphones, respectively by the left and right membranes. The HRTF database used inthis research is the ARI database [6], which contains 256 samples-long HRTFs. Since
HRTFs are often referred to as minimum-phase FIR filters, let d min l," ,#
and d min r," ,#
be
the minimum-phase impulse responses measured from azimuth ! and elevation ".
yl[i] = d min l," ,# [ j ] x[i $ j ]
j =0
M $1
%
yr[i] = d min r," ,# [ j ] x[i $ j ]
j =0
M $1
%
Great… But can we explain what convolution is in plain words? Just like
addition is the mathematical operation of combining two numbers into a third one,
convolution is the operation that allows to combine two signals (the input signal and
the impulse response) into a third one (the output signal). Its symbol is the star *,
which should not be confused with the multiplication symbol used in computer
programs! Practically, the expression: “x[n] * h[n] = y[n]” could be translated into:
“signal x[n] is convolved with impulse response h[n] resulting in output y[n].” At this
point, it is already worth mentioning that convolution is commutative, i.e. x[n] * h[n] =h[n] * x[n] = y[n].
An impulse is a signal whose points are all zeros, besides one. The delta
function, expressed #[n], is a normalized impulse, i.e. its only nonzero sample is
situated at index zero and has a value of one. When the delta function enters a given
linear system the output file is called an impulse response, h[n]. However, any impulse
can be expressed as a shifted and scaled version of the delta function; for instance, let
us consider signal d[n], composed of a sample that has a value of -2 at index n=4, and
whose other samples are all zeros. Signal d[n] is thus a delta function, only shifted to
the right by 4 samples and multiplied by -2. Therefore, d[n] = -2#[n-4]. We say that an
impulse response is the definition to a system of convolution, because when its identity
is known, we know how any signal is going to react when passed through the system.
Actually, the impulse response is the system. Convolution being a paramount building
block of digital signal processing, it is worth noting that the term used to refer to theimpulse response of a system can vary according to the application. Indeed, it is calleda point-spread function in the field of image processing, or a kernel if the considered
system is a filter. As previsouly mentioned, our HRTFs are considered to be filters totheir input files, therefore kernel would be the right term to use in our case.
In most practical cases, the input files of convolution are several thousandssamples long, while the impulse responses are usually much shorter. In our case, the
input files are going to be the audio signals, while our kernels will be the HRTFs,
which are, as mentioned, 256-samples long. The ouput files will be the same audio
signals, but spacialized so the brain is able to localize them at the intended spot in
space. The number of samples contained in those output files will show to be of great
importance in the proposed models presented in further sections. Fortunately, theformula used to calculate this number is very simple; the number of samples in the
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 33/48
output files equals the number of samples in the input file, plus the number of samples
contained in the kernel, minus one.
While convolution can certainly be approached on several different perspectives, the short introduction presented here will merely allow one. The point of
view that shall be focused on is called the “input side algorithm” and will teach us how
the input signal contributes to the making of the output signal. Although the input sidealgorithm perspective does not provide a good mathematical understanding ofconvolution, it does allow to gain some conceptual insights on the process of
convolution, which is exactly what we are aiming for in this current section.
Let us use a simple example of convolution for a 9-point input signal x[n] and a
4-point impulse response h[n].11
The input signal can be decomposed into discrete
samples that can then be considered shifted and scaled versions of a delta function.
Therefore, when looking sample x[2] (“sample situated at index 2 in x”), which has a
value of two, we see that it can be expressed as 2#[n-2] because it corresponds to a
delta function multiplied by 2 and shifted two indexes to the right. After passing
through the system, this component of x called x[2] becomes 2h[n-2]. We can visuallyverify this concept on the second box of figure XXX where the little diamonds serve as
“place holders” in each box, and are just added zeros, while the squares represent the
actual contributions from each point of the input signal x[n].
Very briefly, the input side algorithm works as follows (see Figures 9 and 10):
once the vectors are placed into their respective arrays (x[] for the input file, h[] for the
impulse response) and the programming usual practices are taken care of in the script
(notably zeroing the output array y[] because it serves as an accumulator and therefore
the variable needs to be reinitialized before each execution) two for loops will be
initiated. The first loop allows to go through every single index of x[] to individuallylook at all of the input signal’s samples. For each of them (still associated to modified
delta functions), a second, inner loop allows to calculate a shifted and scaled version of
the impulse response contained in h[]. Each result is then added to the output array y[].
11 The schemes used in this example were taken from the excellent The Scientist and Engineer’s Guide
to Signal Processing written by Steven W. Smith (1997)
Figure 10. Representation of the convolution between signal x[n] and impulse response
h[n] yielding signal y[n]. Taken from "Digital Signal Processing - A Practical Guide for
En ineers and Scientists" written b Steven Smith.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 34/48
Figure 11. Representation of the input-side algorithm. Taken from "Digital Signal Processing - A
Practical Guide for Engineers and Scientists" written by Steven Smith.
4.5. INTRODUCTION TO D IGITAL F ILTERS
This section serves the purpose of giving a very short introduction to the verylarge topic of digital filters. The goal is to convey some insights as to the way our filter
kernels (HRTFs) are going to interact with our input signal.
Every filter is characterized by three main attributes: the impulse response (i.e.
its filter kernel) that makes it possible to find the step response and the frequency
response. It is worth noting that all three of those attributes actually represent the same
information, only described from different perspectives. Indeed, it is no problem to
convert the information found in the impulse response to obtain the step response or
the frequency response. Indeed, integration12
of the impulse response allows to find the
step response, whereas doing a DFT (by means of the FFT algorithm) of this IR allows
to find the filter’s frequency reponse.
The realm of filtering is one of decisions; indeed, there is no such thing as a perfect filter. Therefore, the filter’s characteristics are to be adapted according to its
function. Much in the same way seen in the central auditory nervous system, where the
stimuli from the cochlear nerve were split into two distinct pathways respectively
handling time and frequency information for the good reason that such system wasnecessary in order to preserve the features of the stimuli relevant to each stream, filters
are not able to be performant in both the time and frequency domains at the same time.Therefore, the step response can be focused on if the application requires high time
domain resolution, while the frequency response can be improved if the filter is to be
12
Or, to be mathematically correct, « doing the running sum ». Indeed, integration is an operation thatapplies to continuous signals solely, whereas the running sum is the appropriate term when dealing with
discrete signals.
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 35/48
used in an application demanding a good frequency resolution. Let us take a closer
look at their respective parameters.
First of all, what is the step response? And before that what is a step function?In order to answer this second question, it can be useful to note how we, humans,
interpret signals… Our brain is capable of dividing the stimuli into regions of similar
characteristics (such as noise, then high amplitudes, then low amplitudes, etc.) byidentifying the turning points between those regions, i.e. the points that separate them.
Figure 12. Representations of the good and poor characteristics of a filter designed for a time-
domain application. Taken from "Digital Signal Processing - A Practical Guide for Engineers and
Scientists" written by Steven Smith.
That is exaclty what the step functions are: turning points between zones of
similar characteristics. The step response (that can also be found by doing the running
sum of the impulse) results from feeding a step function into a given system, i.e. in our
case, a filter. Basically, the step response will show us how the step function was
affected by the filter. The step response is composed of three main parameters:
risetime, overshoot and linearity. In order to design a filter for use in the time domain,
the risetime needs to be shorter than the spacing of the events, in order to provide good
resolution. The step response should not overshoot because it distords the amplitude of
samples in the signal. At last, the linearity of the filter is determined by the fact the
upper half of the step response is a point reflection of its lower half (see Figure 11).
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 36/48
Figure 13. Representations of the good and poor characteristics of a filter designed for a
frequency-domain application. Taken from "Digital Signal Processing - A Practical Guide for
Engineers and Scientists" written by Steven Smith.
Filters used for applications in the frequency domain have three main
parameters: roll-off, passband ripple and stopband attenuation. A fast roll-off allows anarrow transition band13, there should be no passband ripple in order for the signal we
want to keep to not be affected by the filtering, and the stopband attenuation should bemaximal.
Now that we can look at digital filter a little more clearly, we can look at thetwo possible ways to filter an input signal: convolution and another process called
recursion. Convolution allows to create FIR (Finite Impulse Response) filters whilerecursion makes up for IIR (Infinite Impulse Response) filters. In theory FIR filters are
fantastic in our case because they have the great feature of not messing around with the
phase of our input signal, which will show to be of paramount importance in further
sections.
13 A transition band is the band situated between the pass-band and the stop-band, i.e. the band it takes to
go from -3dB of the pass-band to the stop-band. It is worth noting that while this claim is correct in theanalog realm, the transition bands in the digital realm were never really standardized and are often
stipulated in percentage (99%, 70,7% —which equals -3dB—, 50%, etc.).
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 37/48
5 THE MACHINE: IMPLEMENTATION STRATEGY
5.1. DESIGN GOALS
The goal of the converting algorithm is to output stereo, spacialized files that
can be as immersive as possible for the listener. Although the script is offline and the
computing time is not of paramount importance, the algorithm must be written
efficiently enough to allow future changes for possible real-time adaptation. Human
psychoacoustics considerations being of extreme importance for the immersive quality
of the output files, they must be placed at the center of the design goals. Practically,
those physiological considerations translate into the following propositions:
(1) The converter must provide as many virtual sound source positions as
possible. The MAA (Minimum Audible Angle) is the basic metric of relative
localization ability of the listener, and is thought of as the smallest angle detectable by
humans in azimuth or elevation for a sound source (Letowski & Letowski, 2012).Therefore, the MAA is a good indicator of the resolution of the auditory localizationsystem. On the azimuthal plane, humans showed the ability to discriminate changes of
only 1° or 2° in the frontal position when wide-band stimuli and low frequency toneswere played (Grothe et al., 2010). Those values were reported to increase to 8-10° at
90° and decrease again to 6-7° at the rear (Letowski & Letowski, 2012). The MAAreported on the elevation plane is about 3-9° in the frontal hemisphere, and almost
twice as large in the rear hemisphere (at 60° in elevation) (Letowski & Letowski,2012). However, it is well worth mentioning that the MAA does not quantify absolute
localization judgements, but only relative ones. The reported measurements were much
larger for the average error in absolute localization for a broadband source: 5° for the
frontal and about 20° for the lateral position (Hofman & Van Opstal, 1998). Thisvaluable information can be useful in estimating the relevance of the ARI HRTF
database used in the present research. The HRTF were measured in incremental steps
of 2,5° in the azimuthal range of ± 45° and of 5° outside this range. The elevation was
measured in increments of 5°. From these facts, we can draw the conclusion that the
ARI database has reasonably good resolution for HRTF interpolation not to be
considered in the case of the present research.
(2) The converter's time reference must be short enough to provide good
resolution in the sound sources' movement. Humans' ability to perceive sound motion
is effective through a series of cues: the main ones being the radial and the angular
velocities (Letowski & Letowski, 2012). The radial velocity is the one at which soundsources move towards or away from the listener, directly affecting the sound intensity
as well as inducing Doppler shifts in sound frequency. On the other hand, the angularvelocity represents the velocity at which sounds rotate around the listener and is
perceived through monaural and binaural localization cues. Although the radialvelocity has little impact in the present research because the ARI HRTF database does
not include ear-source distance variations, the angular velocity turns out to be very
useful information to work with. The MAMA (Minimum Audible Movement Angle),
that is the primary metric used in reporting perceived sound source motion, is defined
as the smallest angular distance the sound source has to travel, so that its direction of
motion is detected. It could therefore be thought of as the detection threshold for
movement. The MAMA is the smallest in the listener's frontal plane and increases asthe sound source moves away to the sides of the head. Indeed, a minimum duration of
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 38/48
150–200 ms in the 0°–60° range of observation angles were reported and the durations
increased by ~25%–30% at larger angles for sound sources moving at low velocities.
150ms thus seemingly being the shortest time for a human to perceive sound sourcemotion, the time reference for the models was chosen to be inferior to that value,
namely 100ms, so that excellent movement resolution could be obtained.
(3) The converter must induce minimal phase artifacts.
5.2. Overview of the Channel Based Model
The converter is fed audio files in the form of channels. The simplest form of
algorithm would be to convolve those channels with the HRTFs corresponding to the
physical positions of the speakers in the required layout, effectively creating static,
"virtual speakers". However, the sense of immersion induced by this technique is
limited because of the very few sound source positions available. In order to meet the
first requirement of our design goals, two specific categories of sounds are to be
distinguished: static sounds and sounds evolving in space. The implementation of the
following process is suggested in order for the program to distinguish between both
categories. Nonetheless, it is worth noting that the presentation of such process does
not aim at providing an exhaustive and precise implementation strategy for the
channel-based algorithm, but rather is used as a mean to recognize the type of
processing required for the binaural auralization of 3D channel-based content.
First of all, a timecode is applied to the audio content in order to provide it witha time reference system. As justified in section 3.1, the time reference is the tenth of
second. The signals contained in every channel are then analyzed to reveal theirfrequency domains in order to have knowledge of the energy contained in each of their
sub-bands. However, because the signals that are dealt with in the present application
are usually non stationary, the proposed method for such process requires the use ofwavelets transforms as opposed to the traditional Fourier transform, which is not
suitable in this case because of its lack of precision in the revelation of a non stationary
signal's temporal structure. Through the scaling and the time shifting of the mother
wavelet function, the input signal can effectively be analyzed and reveal its spectral
content as intended.
The next step in the process leads to a complex comparison system of the all of
the channels' sub-bands' RMS values with a windowing-time of 100ms. The purpose of
such system is to monitor the spectral activity of the channels in order to draw
conclusions about the spatial evolution of sounds from one channel to the other. Each
channel's sub-bands are compared to the corresponding sub-bands of the channels
played back on speakers whose physical positions are adjacent to the analyzed channel,100ms later. Let us exemplify this idea using Barco®'s Auro 11.1 speaker layout (L, C,
R, Ls, Rs, HL, HC, HR, HLs, HRs, VoG and LFE), with the analysis of the Right
channel's sub-band centered on 1KHz. This Right channel's sub-band's RMS value at
time 00:00:00:10 will be compared to the RMS values of the same sub-bands (i.e.centered on 1KHz) of the C, HC, HR, HRs and Rs channels at time 00:00:00:20. For
simplicity's sake, let us continue expressing the present idea with only one of R'sadjacent channels, namely C. When comparing the RMS value of R's sub-band
centered around 1KHz with C's, three different outcomes can be expected: the RMSvalue of the R's 1KHz sub-band at 00:00:00:10 can be either greater, equal or smaller
than C's at 00:00:00:20. Depending on the reached outcome, logical conclusions can be
drawn from these observations: if R's sub-band is quieter than C's, chances are likelythat whatever sound containing energy at 1KHz, is evolving from the right to the
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 39/48
center. Inversely, if R's sub-band is louder than C's, the sound is evolving form the
center speaker to the right one. If the energy contained in R's sub-band is the same as
C's, the sound scene is likely to be static at that time. However, this form of spectraltracking method is far from being infallible: for example, it could not track sounds
whose spectral content would be evolving along their positions in space.
Channels are then separated into sub-bands using pass-band FIR filters, andeach channel’s sub-bands are convolved with HRTFs corresponding to the positionsretrieved from the analysis of the channels’ spectral contents method explained in the
previous paragraph; static sound sources are convolved with the HRTFs corresponding
to the virtual speakers positions whereas sound sources evolving in space are
convolved with HRTFs corresponding to intermediate positions between those virtual
speakers. However, this method of binaural spacialization yields phase distorsion of
the channels’ signals, which goes against proposition (3) of the design goals.
This brief explanation allows one to start realizing the complex development
procedures required in order to binaurally translate 3D channel-based audio contents
while aiming to reach the design goal expressed in 5.1 stating "The converter must
provide as many virtual sound source positions as possible". Indeed, in order to do so asystem of detection of wave patterns coupled with minimum-phase FIR filters (whose
qualitative performances can be very high if properly designed but would subsequently
show poor computational efficiency), should allow the script to perfectly "crop" every
single elements of the audio content in order to individually convolve them with the
HRTFs corresponding to their positions. Although nowadays such procedure would be
impossible to achieve with perfect results, it would effectively turn channel-basedcontents into object-based ones.
5.3. OVERVIEW OF THE OBJECT B ASED MODEL
Sounds are considered to be objects, each with their own set of spatial
coordinates in regard to the time-reference. Since the HRTF measurements from the
ARI database only include the direction variation of the incoming signal and not the
ear-source distance (like most of the available HRTF databases), only two axes are
relevant in our position coordinates system: azimuth (!) and elevation ("). The
azimuth parameter will have increments of 2,5° from -45° to +45° and of 5° for the rest
of the sphere. The elevation parameter will have increments of 5° throughout. As
explained in 5.1 the time reference is the tenth of a second in order to provide good
locational accuracy when the need arises to process objects that quickly evolve in
space.
The spatial coordinates and the time-reference (timecode) for each object arestored in a .txt file. The purpose of the program is to read into the object’s own .txt file
to use its coordinates in regard to the timecode, associate its coordinates to the
corresponding HRTF, and convolve this HRTF with the object. For best efficiency, a
function allowing the coordinates/HRTFs association can easily be built into the program so it does not have to be recreated during each execution.
During each 100ms-window, a number of samples from the object areconvolved with the HRTF corresponding to their coordinates. The number of samples
will depend directly on the sampling rate of the .wav object. For example, an objectwhose sampling rate is 44100Hz is segmented into pieces every 4410 samples. Those
4410 samples are then convolved with the HRTF corresponding to their coordinates at
that point in time. All of ARI's HRTFs’ lengths being equal to 256 samples, accordingto basic convolution rules the number of samples resulting from this single operation
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 40/48
will thus amount to 4410 + (256-1) = 4665 samples (which will be referred to as
"window convolution output samples"). When these window convolution output
samples are placed side by side into a vector, a system of overlap ensures that the totalamount of samples does not increase as a side effect of the convolution process, and
allows as well to not disregard the valuable information contained in the tails of the
window convolution outputs. In practice, still in the case of an input signal originallysampled at 44,100Hz, the first 4665 samples resulting from the first convolution arelaid into a vector, but the second "load" of 4665 samples will start at index 4409,
adding themselves with the remaining 255 samples from the first window convolution
outputs. This operation goes on until all of the object’s samples are processed.
As explained earlier, HRTFs work in pairs. Therefore, for each object, this
aforementioned process requires to be carried out twice: once for each transfer function
at any given position.
Once the convolution process is finished for all of the objects, their vectors can
then all be padded with 0’s in order for them to have the same vectorial dimensions.
After padding, two variables that will be referred to as “Left Ear Bucket” and “Right
Ear Bucket” are initialized. Those so-called buckets will respectively gather all thevalues stored in the vectors affected by the “left ear HRTFs” (in the Left Ear Bucket)
and by the “right ear HRTFs” (in the Right Ear Bucket).
The last step concerns normalization. In this case, the EBU R128 standard was
chosen in order to normalize signals to an appropriate level and is implemented
through the use of C++ libraries made available.
Independently from these considerations, a good question to raise is one thataddresses the capability of the object-based model to reproduce diffused-field audio.
Given the fact that, by definition, one object can only hold a maximum of one positionat a given point in time it would then be impossible to reproduce a soundcape recorded
with a microphone array using objects solely. Such sounds belong to the channel-baseddomain and are commonly referred to as “beds”. They require the use of up- or down-
mixing in order to be adapted to the number of speakers available in the object-based
reproduction system used.
5.4. CHANNELS VS. OBJECTS : CONCLUSION
Although the perspectives of the channels- and object-based implementation
models presented in this paper are very different from each other, they actually
complement each other nicely. Indeed, as we have seen, the channel-based model is
very indicated for sounds containing diffused-field, since such sounds surround the
listener and thus originate from several positions. However, this model does not allowthe convenient binaural localization of sound sources evolving in space. Those types of
sounds are best handled by the object-based model, which allows to easily associate
the individual sound sources positions to the corresponding HRTFs. The creation of a
hybrid algorithm including the respective strength of both the presented models would be promising and hold much potential for the field of binaural conversion of 3D
contents.As a side note, another conclusion can be reached: in order to have the best
output quality as possible, one necessarily has to conceive a real-time algorithm.Indeed, as we have seen, good localization possibilities would rely on the listener’s
ability to rotate his/her head in order for his/her brain to improve the auditory object’s
localization. Thanks to several different available face-tracking systems (webcams,optical, electromagnetic, etc.), the listener is able to rotate his head while the system
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 41/48
interprets those rotations and maintain the soundstage up front. However, for best
results the latency should be minimal and a system of HRTF interpolation should be
set up in order to improve the resolution of auditory objects movements.
6 AFTERWORD
This work served the objective of introducing three-dimensional hearing onthree different perspectives, namely the body’s, the brain’s and the machine’s.
Although it is obvious that entire books could be written about every little section of
this paper, my intention was to provide my readers with a glimpse into several very
different concepts that are part of fields of study that usually do not blend in with each
other. However, it is my firm belief that that is where the future resides: nowadays the
complexity of the knowledge can be such that researchers become extremely
specialized within a single area of their field, therefore sometimes losing the general
view. I believe that getting interested in several fields of study is a good way to keep
inspired and to keep this notion of ensemble because in the end everything is inter-
related. How fascinating!
Let us review what we have gone through. Starting off in Chapter 1 with some
physics fundamentals required for the proper understankding of the second one, we
notably went over the defining concepts of a sound wave and its propagation in air
before moving on to discussing the impedance of a medium. The Fourier analysis wasthen briefly introduced before a word on the methematical definitions of linearity.
Chapter 2 started directing us more toward the subject of this paper. Indeed, theBody section discussed the outer ear, including its composition, roles and a little focus
on the important role of the pinna. We then discussed the middle ear and its ossicles, paramount in the impedance transformation process, which is a concept that was
introduced in the first chapter. We also went over the action of the muscles of themiddle ear that allow to reduce the intensity of the soundwave when entering the oval
window. Then, the inner ear and the cochlea were introduced. We went over the inner
ear’s composition while focusing on the cochlea’s important role.We saw its anatomy,
mechanics and discussed its organ of Corti containing the hair cells and stereophilia
responsible for the electrochemical translation of mechanical phenomena. Still within
the cochlea, we saw its physiological functionings with a view on Von Békésy’s work.
The last section of this second chapter discussed the fluids contained in the cochlea’s
scalae, the perilymph and endolymph, whose very special composition pointed at a
very special role.
Chapter 3 started where Chapter 2 left us: with electrochemical impulses.
Welcome to the brain’s domain! We learnt about the different pathways it uses to preserve the precious information that requires to be refined through several different
stages before finally being summerized into a so-called auditory object. We hear! A
short description of the ascending pathways is provided pp. 17 and 18 with a veryuseful figure that helps to visualize our introductory voyage through the central
auditory nervous system. A second section with the third chapter allowed us to gainsome level of insight into the way we, humans, localize sound sources on the
horizontal (mainly relying on ITDs) and vertical planes (mainly relying on ILDs). Wealso addressed a short discussion about the cues involved in our ability to estimate the
distance from a sound source.
Chapter 4 initiates a shift in perspective on the three-dimensional hearing topic,
and aims at introducing several concepts whose understanding will show to be usefulin the discussions provided in Chapter 5. We introduced the concept of the Machine,
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 42/48
which is an attempt to present and compare the implementation strategies of two
algorithms (channel- and object-based) for the offline binaural conversion of 3D audio
contents. We then went over an historical introduction on why binaural conversion is asolution to a situation we are facing in today’s world, namely the fact that the multi-
channel sound solutions seem to be meant to remain in the movie theaters as the cost
and space required for consumers to acquire such systems is prohibitive in many cases.We subsequently continued on Chapter 4’s main purpose and went over the conceptsof HRTFs, compared channels and objects, then introduced convolution and digital
filters.
Chapter 5 brings us to the heart of the matter, with three propositions that
constitute the algorithm’s common design goals, relying on recent research in
psychoacoustics. Section 5.2 overviews the channel-based model, while section 5.3
handles the object-based one, before concluding in section 5.4 on which is best for
what application.
Chapter 6 is the current chapter. Chapter 7 offers two appendices. The first one
is a « bonus »: it’s the MATLAB script for the channel-based « beds » presented
earlier, which I programmed it with the help of Thomas Pairon. This is meant to showwhat the Saving Private Ryan 5.1 audio files (burned on the attached CD) went
through.
Chapter 8 presents the references I used to write this paper. You will find
references of the books, the websites and the cited work.
Thank you for your attention!
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 43/48
7 APPENDIX
7.1. APPENDIX A: CHANNEL B ASED MATLAB SCRIPT
clear all;
close all;
%OPEN HRTF FILES:
LHRTFLeft = fopen('L0e030a.dat','r','ieee-be'); % L CHANNEL
LDataLeft = fread(LHRTFLeft,256,'short');
fclose(LHRTFLeft);
LHRTFRight = fopen('L0e210a.dat','r','ieee-be');
LDataRight = fread(LHRTFRight,256,'short');
fclose(LHRTFRight);
CHRTFLeft = fopen('L0e000a.dat','r','ieee-be'); % C CHANNELCDataLeft = fread(CHRTFLeft,256,'short');
fclose(CHRTFLeft);CHRTFRight = fopen('L0e180a.dat','r','ieee-be');
CDataRight = fread(CHRTFRight,256,'short');fclose(CHRTFRight);
RHRTFLeft = fopen('L0e330a.dat','r','ieee-be'); % R CHANNEL
RDataLeft = fread(RHRTFLeft,256,'short');fclose(RHRTFLeft);
RHRTFRight = fopen('L0e150a.dat','r','ieee-be');
RDataRight = fread(RHRTFRight,256,'short');
fclose(RHRTFRight);
LsHRTFLeft = fopen('L0e250a.dat','r','ieee-be'); % Ls CHANNEL
LsDataLeft = fread(LsHRTFLeft,256,'short');
fclose(LsHRTFLeft);
LsHRTFRight = fopen('L0e070a.dat','r','ieee-be');
LsDataRight = fread(LsHRTFRight,256,'short');
fclose(LsHRTFRight);
RsHRTFLeft = fopen('L0e110a.dat','r','ieee-be'); % Rs CHANNEL
RsDataLeft = fread(RsHRTFLeft,256,'short');fclose(RsHRTFLeft);
RsHRTFRight = fopen('L0e290a.dat','r','ieee-be');RsDataRight = fread(RsHRTFRight,256,'short');
fclose(RsHRTFRight);
% .WAV FILES IMPORT
LLeft = wavread('SPR-L.wav');
LRight = wavread('SPR-L.wav');
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 44/48
CLeft = wavread('SPR-C.wav');CRight = wavread('SPR-C.wav');
RLeft = wavread('SPR-R.wav');
RRight = wavread('SPR-R.wav');
LsLeft = wavread('SPR-Ls.wav');
LsRight = wavread('SPR-Ls.wav');
RsLeft = wavread('SPR-Rs.wav');
RsRight = wavread('SPR-Rs.wav');
LFE = wavread('SPR-LFE.wav');
% CONVOLUTIONS
LConvHRTFLeft = conv(LLeft,LHRTFLeft); % L CHANNELLConvHRTFRight = conv(LRight,LHRTFRight);
CConvHRTFLeft = conv(CLeft,CHRTFLeft); % C CHANNEL
CConvHRTFRight = conv(CRight,CHRTFRight);
RConvHRTFLeft = conv(RLeft,RHRTFLeft); % R CHANNELRConvHRTFRight = conv(RRight,RHRTFRight);
LsConvHRTFLeft = conv(LsLeft,LsHRTFLeft);% Ls CHANNEL
LsConvHRTFRight = conv(LsRight,LsHRTFRight);
RsConvHRTFLeft = conv(RsLeft,RsHRTFLeft);% Rs CHANNEL
RsConvHRTFRight = conv(RsRight,RsHRTFRight);
% PADDING:
TotalLength = [length(LConvHRTFLeft) length(LConvHRTFRight)
length(CConvHRTFLeft) length(CConvHRTFRight) length(RConvHRTFLeft)
length(RConvHRTFRight) length(LsConvHRTFLeft) length(LsConvHRTFRight)
length(RsConvHRTFLeft) length(RsConvHRTFRight) length(LFE) ];
L = zeros(max(TotalLength),2);for i = 1:length(LConvHRTFLeft)
L(i,1) = LConvHRTFLeft(i);L(i,2) = LConvHRTFRight(i);
end
C = zeros(max(TotalLength),2);
for i = 1:length(CConvHRTFLeft)
C(i,1) = CConvHRTFLeft(i);
C(i,2) = CConvHRTFRight(i);
end
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 45/48
R = zeros(max(TotalLength),2);for i = 1:length(RConvHRTFLeft)
R(i,1) = RConvHRTFLeft(i);
R(i,2) = RConvHRTFRight(i);
end
Ls = zeros(max(TotalLength),2);
for i = 1:length(LsConvHRTFLeft)
Ls(i,1) = LsConvHRTFLeft(i);
Ls(i,2) = LsConvHRTFRight(i);end
Rs = zeros(max(TotalLength),2);
for i = 1:length(RsConvHRTFLeft)Rs(i,1) = RsConvHRTFLeft(i);
Rs(i,2) = RsConvHRTFRight(i);end
LFE = zeros(max(TotalLength),1);
for i = 1:length(LConvHRTFLeft)
L(i,1) = LFE(i);
end
% ASSEMBLY INTO BUCKETS
Bucket = [L(:,1) + C(:,1) + R(:,1) + Ls(:,1) + Rs(:,1) + LFE
L(:,2) + C(:,2) + R(:,2) + Ls(:,2) + Rs(:,2)];
% BUCKET NORMALISATION:
BucketNorma = Bucket/max(abs(Bucket))/2;
%EXPORT
wavwrite(BucketNorma,44100,'SPR-5_1_matlab.wav');
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 46/48
8 REFERENCES
8.1. B IBLIOGRAPHY
• Pickles, James O. An Introduction to the Physiology of Hearing. 3rd ed. Leiden:
Brill, 2013. Print.
• Botte, M.C. Psychoacoustique Et Perception Auditive. N.p.: INSERM/SFA/CNET,
n.d. Print. Série Audition.
• Ward, Jamie. The Student's Guide to Cognitive Neuroscience. Hove [England:
Psychology, 2006. Print.
• Popper, Arthur N., and Richard R. Fay. Sound Source Localization. New York:
Springer, 2005. Print.
• Smith, Steven W. The Scientist and Engineer's Guide to Digital Signal Processing .
San Diego, CA: California Technical Pub., 1997. Print.• Wang, DeLiang, and Guy J. Brown. "Chapter 5: Binaural Sound Localization."
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications.
Hoboken, NJ: Wiley Interscience, 2006. N. pag. Print.
• J. Blauert, “Spatial Hearing: The Psychophysics of Human Sound Localization”,
MIT Press, Cambridge, MA, 1997
• G. von Békésy, Experiments in Hearing (Translated and edited by E. G.
Wever), McGraw-Hill, New York, 1960
8.2. WEBSITES
• Tewfik, Ted L., M.D. "Auditory System Anatomy ." Auditory System Anatomy.
MedScape, n.d. Web. Aug. 2013.
8.3. C ITED WORKS
Mershon, D. H., & King, L. E. (1975). Intensity and reverberation as factors in the
auditory perception of egocentric distance. Perception & Psychophysics, 18(6), 409-
415. doi: 10.3758/BF03204113
A. D. Blumlein. U.K. Patent 394,325, 1931. Reprinted in Stereophonic Techniques,Audio Eng. Soc., NY, 1986
A. Kulkarni et al., "On the Minimum-Phase Ap- proximation of Head-Related TransferFunctions," in Proc. 1995 IEEE ASSP Workshop on Applications of Signal Processing
to Audio and Acoustics (IEEE catalog no. 95TH8144).
ARI HRTF Database detailed description, retrieved on 07/17/13 fromhttp://www.kfs.oeaw.ac.at
Beckius GE, Batra R & Oliver DL (1999). Axons from anteroventral cochlear nucleus
that terminate in medial superior olive of cat: observations related to delay lines. J
Neurosci 19, 3146 –3161
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 47/48
Butler, R. A., & Humanski, R. A. (1992). Localization of sound in the vertical plane
with and without high-frequency spectral cues. Perception & Psychophysics, 51(2),182-186.
Chen, C. I., & Wakefield, G. H. (2001). Introduction to head-related transfer functions(hrtfs): Representations of hrtfs in time, frequency, and space. J Audio Eng Soc, 49(4),231-249.
Goupell, M. G., Majdak, P., & Laback, B. (2009). Median-plane sound localization as
a function of the number of spectral channels using a channel vocoder. J. Acoust. Soc.
Am., 127 (2), 990-1001. doi: 10.1121/1.3283014
Grothe, B., Pecka, M., & McAlpine, D. (2010). Mechanisms of sound localization in
mammals. Physiol Rev, 90, 983-1012. doi: 1152/physrev.00026.2009
Jeffress, L. (1948). A place theory of sound localization. Journal of Comparative and
Physiological Psychology, 41(1), 35-39.
Johnstone, B., & Sellick, P. (1972). The peripheral auditory apparatus. Quarterly
reviews of biophysics, 5(01), 1-57.
L. Rayleigh, "On Our Perception of Sound Direc- tion," Philosoph.Mag., vol. 13, 1907
Langner, B., Black, A. (2005) Using Speech In Noise to Improve Understandability for
Elderly Listeners, ASRU 2005, San Juan, Puerto Rico
Liberman MC, Dodds LW, Pierce S (1990) Afferent and efferent innervation of the catcochlea: quantitative analysis with light and electron microscopy. J Comp Neurol
301:443–460.
Little, A., Mershon, D., & Cox, P. (1992). Spectral content as a cue to perceived
auditory distance. Perception, 21(3), 405-416.
McAlpine, D. (2005). Creating a sense of auditory space. J. Physiol., 556 (1), 21-28.
Naguib, M., & Wiley, R. H. (2001). Estimating the distance to a source of sound:
mechanisms and adaptations for long-range communication. Animal Behaviour , 62,
825-837. doi: 10.1006/anbe.2001.1860
Nam, J., Kolar, M. A., & Abel, J. S. (2008). On the minimum-phase nature of head-
related transfer functions. Audio Engineering Society 125th Convention paper .
P .M. Hofman and A.J. V an Opstal, “Spectro- Temporal Factors inTwo-dimensional Human Sound Localization”, in Journal of Acoustical Society of
America, vol. 103, 2634-2648 (1998).Pecka M, Zahn TP, Saunier-Rebori B, Siveke I, Felmy F, Wiegrebe L, Klug A, Pollak
GD, Grothe B (2007) Inhibiting the inhibition: a neuronal network for soundlocalization in reverberant environments. J Neurosci 27:1782–1790
8/13/2019 Mémoire Romain Boonen - Three-Dimensional Hearing.pdf
http://slidepdf.com/reader/full/memoire-romain-boonen-three-dimensional-hearingpdf 48/48
Searle. (1975). The contribution of two ears to the perception of vertical angle in
sagittal planes. J. Acoust. Soc. Am., 109(1596), 8 pages.
Stevens, S., & Newman, E. (1936). The localization of actual sources of sound. The
american journal of Psychology, 48(2), 297-306.
T. R. Letowski and S.T. Letowski, “Auditory Spatial Perception: AuditoryLocalization”, Army Research Laboratory (ARL), 2012 April
Wallach, H. (1940). The role of head movement and vestibular and visual cues in
sound localization. Journal of Experimental Psychology, 27 (4), 339-368.
Wever, E. G., & Vernon, G. A. (1955). The threshold sensitivity of the tympanic
muscle reflexes. Arch. Otolaryngol , 62, 204-213.
Wiener, F. M., & Ross, D. A. (1946). The pressure distribution in the auditory canal in
a progressive sound field. J. Acoust. Soc. Am., 18(2), 401-408.
Wightman, F. L., & Kistler, D. J. (1997). Monaural soundlocalization revisited. Journal
of the Acoustical Society of America,101, 1050-1063.
Wightman, F. L., & Kistler, D. J. (1999). Resolution of front-back ambiguity in spatial
hearing by listener and source movement. J. Acoust. Soc. Am., 105(5), 2841-2853.
Yan-Chen, L., & Cooke, M. (2010). Binaural estimation of sound source distance viathe direct-to-reverberant energy ratio for static and moving sources. Audio, Speech,
and Language Processing, IEEE Transactions on, 18(7), 1793-1805. doi:10.1109/TASL.2010.2050687
Zahorik, P. (2001). Estimating sound source distance with and without vision.
Optometry and Vision Science, 78(5), 270-275. doi: 1040-5488/01/7805-0270/0
Recommended