Enhanced Di use Field Model for Ad Hoc Microphone Array ...publications.idiap.ch/downloads/papers/2014/Taghizadeh...ad hoc microphone array. However, at the core of steered high quality

Enhanced Diffuse Field Model for Ad Hoc Microphone ArrayCalibration

Mohammad J. Taghizadeha,b, Philip N. Garnera, Hervé Bourlarda,b

Emails: {mohammad.taghizadeh, phil.garner, herve.bourlard}@idiap.ch

aIdiap Research Institute, Martigny, SwitzerlandbÉcole Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Abstract

In this paper, we investigate the diffuse field coherence model for microphone array pairwise

distance estimation. We study the fundamental constraints and assumptions underlying this ap-

proach and propose evaluation methodologies to measure the adequacy of diffuseness for mi-

crophone array calibration. In addition, an enhanced scheme based on coherence averaging and

histogramming, is presented to improve the robustness and performance of the pairwise distance

estimation approach. The proposed theories and algorithms are evaluated on simulated and real

data recordings for calibration of microphone array geometry in an ad hoc set-up.

Keywords: Ad hoc microphone array calibration, Diffuse field coherence model, Adequacy of

diffuseness

1. Introduction

Microphone arrays are widely used in meeting rooms and teleconferencing applications.

They are specifically employed to improve the speech quality by steering the beampattern to-

wards a desired speaker [1, 2]. A plethora of applications includes distant speech recogni-

tion [3, 4], speaker localization [5, 6] and speech separation [7]. Recent advances in mobile

computing and communication technologies enable use of cell phones, PDAs or tablets as an

ad hoc microphone array. However, at the core of steered high quality acquisition, traditional

localization and beamforming techniques are impractical without sufficient prior information on

the microphone array geometry. Hence, in order to enable the effective use of ad hoc micro-

phones for sound applications, calibration of the microphone array is required.Preprint accepted at Signal Processing June 3, 2014

State of the art calibration techniques can be grouped into three categories. The first approach

relies on transmitting a known signal to perform microphone calibration. Sachar et al. [8] pre-

sented an experimental setup using a pulsed acoustic excitation generated by five domed tweeters.

The transmit times between speakers and microphones were used to find the relative geometry.

Raykar et al. [9] used a maximum length sequence or chirp signal in a distributed computing

platform. The time difference of arrival of the microphone signals were then computed by cross-

correlation and used for estimating the microphone locations. Since the original signal is known,

these techniques are robust to noise and reverberation.

The second category enables using an unknown signal; the microphone calibration is usu-

ally integrated with source localization. Flanagan and Bell [10] proposed a method using the

Weiss-Friedlander technique, where the sensor location and direction of arrival of the sources

are estimated alternately until the algorithm converges. Another approach was proposed by Chen

et al. [11] where they introduced an energy-based method for joint microphone calibration and

speaker localization. The energy of the signal is computed and a nonlinear optimization problem

is formulated to perform maximum likelihood estimation of the source-sensor positions. This

method requires several active sources for accurate localization and calibration.

McCowan et al. [12] proposed a calibration method based on the characteristics of a diffuse

sound field model. A diffuse field can be roughly described as an acoustic field where the sig-

nals propagate with equal probability in all directions with the same power. The diffuse field is

verified for meeting rooms and car environments [13] and it enables application of well-defined

mathematical models for analysis of the acoustic field recordings. A particular property related

to diffuse field recordings is the coherence function between pairwise microphone signals which

is defined by a sinc function of the distance between the two microphones. Thereby, we can

estimate the pairwise distances by least-squares fitting the computed coherence with the sinc

function. This procedure is accomplished for each frame independently. To increase the robust-

ness, the frame-based estimates are combined using k-means clustering [12]. This approach is

applicable in a general room without the need for any explicit initialization or activating calibra-

tion signals. The study presented in this paper is built on the idea of incorporating the properties

2

of a diffuse field for ad hoc microphone array calibration.

The diffuse field has been studied rather extensively by many researchers with the aim of de-

veloping practical strategies for determining sound power, absorption measurements, and trans-

mission loss. However, very few studies consider applicability of the associated models for

microphone calibration. The purpose of this paper is to investigate the fundamental hypotheses

of the diffuse field model and to elucidate the limitations and the scope of its applicability. The

study of sound fields in lightly damped enclosed spaces can be approached in two different ways.

One is based on solving the wave equation with known boundary conditions, which leads to de-

scriptions in terms of the modes of the room. The other approach relies on statistical models for

analysis of the field and requires far less information about the room geometry. We apply both of

these methods to highlight the requirements for application of the diffuse field model to enable

microphone array calibration.

The paper is organized as follows: The fundamentals of diffuse fields are studied in Section 2.

We overview the characteristics and models of the diffuse field and the measurement for diffuse-

ness. The methods to enhance the diffuse sound model are proposed in Section 3 and applied in

the framework of microphone array calibration in Section 4. The fundamental limitation of the

diffuse model are explained in Section 5. The experimental analyses are presented in Section 6

and the conclusions are drawn in Section 7.

2. Diffuse Field Fundamentals

2.1. Definition of Diffuse Field

A diffuse field is defined as an acoustic field consisting of a superposition of an infinite

number of sound waves traveling with random phases and amplitudes such that the energy density

is equivalent at all points. More precisely, all points in the field radiate equal power and random

phase sound waves, with the same probability for all directions, and the field is homogeneous

and isotropic [14]. A diffuse field can be realized if a point source is active in a highly echoic

room. By removing the direct sound and the initial reflections from a recording of the sound,

the remaining part consists of diffuse reflections. In addition, ambient distributed sound sources3

yield a diffuse field, while the interference phenomena near the room boundaries and corners

raise the energy level and reduce the diffuseness. In a free space, having many uncorrelated

sources distributed at long distances can generate a diffuse field.

The diffuse sound field at its theoretical level does not exist in practice. However, in many

cases, a diffuse sound field can be a useful approximation of the real acoustic field in an enclosure.

The important point is then to measure the amount of diffuseness and evaluate its adequacy for

different applications. The analytic studies consider two points of view: (1) the wave equation

based approach that describes diffuse field through the modes in a room and (2) the statistical

approach by considering an infinite number of free propagation plane waves, referred to as the

plane wave model.

2.2. Diffuse Field Model

2.2.1. Mode Model

This theory analyzes a room as a pack of resonators with bandwidth proportional to the

absorption of the walls [15, 16]. The 3 dB bandwidth of the mode is given by

B3dB =1

2πτ, (1)

where τ corresponds to the decaying time constant of the sound field energy [17].

By solving the equations of a homogeneous sound field with boundary conditions, we extract

normal modes for the room. Each mode indicates a resonance frequency, and the distribution

of these frequencies is determined by the shape and dimension of the room [18, 19]. At high

frequencies f , the mode density depends solely on the room volume V as expressed through

γ( f ) =4πVc3

f 2, (2)

where c denotes the speed of sound. The modal overlap is defined as the average number of

4

modes excited by a pure tone, and it is given by

η( f ) = γ( f )B3dB =4πVc3

f 21

2πτ(3)

If the pure tone is close to the frequency of the mode, within a bandwidth of 2.2/T60, the

adjacent mode is excited; T60 is equal to the time required for the level of a steady sound to

decay by 60 dB after the sound has stopped. If η( f ) ≥ 31, there are enough excited modes to

generate a diffuse field in the room [20], hence the critical frequency to achieve diffuseness is

obtained as

fs =

√3c3τ2V

(4)

This frequency is known as the Schroeder frequency [21, 22].

2.2.2. Plane Wave Model

An alternative analysis approach, which does not need acoustic information, relies on a sta-

tistical model. In the plane wave model or the statistical model, a diffuse field is defined by the

superposition of a large set of plane waves impinging from all directions. We consider the steady

state sound field generated by a pure tone source in a reverberant room. The time domain sound

pressure P(t) at a point far from the walls and the source is expressed as

P(t) = limq→∞

q−1/2q∑

i=1

bi cos(ωt + ϕi), (5)

where bi and ϕi are random variables and independent of each other; ϕi has a uniform distribution

in [0, 2π] and bi has a normal distribution; ω denotes the angular frequency and q is the number of

plane waves. Each point in the field receives sound pressure from all directions [21]. Considering

1Deriving the 3D modes in a rectangular room, a decomposition of an oblique mode into eight plane waves canbe obtained. Hence, for Υ model overlap, we get 8Υ plane waves. Some heuristics indicate that 24 plane waves is alower bound for generating diffuse sound, therefore Υ = 3 is the smallest value to achieve diffuseness as considered inSchroeder frequency (4).

5

this spatial uniformity, we can compute an average sound pressure through

P(t) = limq,m→∞

(qm)−1/2m∑

j=1

q∑i=1

bi j cos(ωt + ϕi j), (6)

where m is the number of different directions from which plane waves impinge on a point in

the field. In three dimensions, the distribution of the plane waves is such that there is at least

one plane wave at each 4π/m steradian. The plane wave model is particularly useful at medium

to high frequencies; it requires no details about the room geometry. The accuracy however,

degrades at low frequencies and the effects of interference is ignored. Waterhouse [23] extended

this approach by considering the interference phenomena that occur near the walls. The studies

in this paper rely on the basic mode model and the plane wave model.

3. Enhanced Diffuse Field Model

3.1. Averaging the Coherence Function

3.1.1. Cross Correlation

The correlation function of the sound pressures at two points in an acoustic field is defined as

C =

∫ T0 P1(t)P2(t)dt√∫ T

0 P21(t)dt

∫ T0 P

22(t)dt

. (7)

The cross correlation function in a diffuse field has a closed form analytic solution [24, 25].

Suppose a plane wave passes two points located on the z-axis with separation d, the correlation

function would be cos(κd cosφ) where κ is the wavenumber and φ is the polar angle defined as

the angle between the wave front and the line connecting the two points [26]. The value of C for

a diffuse field can be obtained by averaging the cross correlation for all directions, as

C =∫ π

0

∫ 2π0

cos(κd cosφ) sinφ dθ dφ/4π

= sin(κd)/(κd),

(8)

where θ is the azimuth angle.6

3.1.2. Coherence Averaging

We consider a scenario in which n microphones record a diffuse field pressure signal. Sup-

pose that S i and S l represent the spectral representation of the signals in Fourier domain at

microphones i and l respectively. The cross spectral density is

Φil(ω) = S i(ω)S ∗l (ω), (9)

where “∗” is the conjugate transpose operator. The coherence of two signals is the cross spectrum

normalized by the square roots of the auto spectra, defined concisely as

Γil(ω) =Φil(ω)√

Φii(ω)Φll(ω). (10)

In a perfect diffuse field, at each frequency component, the coherence is a sinc function, which

holds if long time averaging (7) is taken [27]. As the frequency analysis is conducted on short

frames, we propose to collect several frames and take an average over the frame-based coherence

to achieve an estimate conforming to the sinc model. Therefore, we define an average coherence

function as

Γ̃il(ω) =1J

J∑j=1

<(Γ

jil(ω)

)= sinc

(ωdil

c

), (11)

where the operator

is generated by the ambient sources such as running devices, computers, etc., the amplitude of

the source signal is very weak. Therefore the prohibitive cost of air absorption affects the energy

distribution. This condition tends to violate the necessary assumption of negligible energy loss

during a mean free propagation. Hence, we propose to provide additional sources in a particular

set up to boost the sound field power.

The diffuse field is better realized for high frequencies, as more modes are excited leading to

an increase in the number of plane waves (Table 1). However the air absorption also increases

with frequency; the acoustic intensity2 of a plane wave as a function of the propagation distance

r is expressed as

I(r, ω) = I0(ω)e−r/ξ(ω), (12)

where I(r, ω) is the intensity r meters from the source, I0(ω) is the original intensity of the source

with frequencyω and 1/ξ(ω) is the attenuation factor, which increases with frequency. Therefore,

if the source has a very low power, the high frequencies can diminish and the low frequencies,

which do not excite enough resonance modes, remain in the sound field. This phenomenon

reduces the diffuseness. Hence, we speculate that increasing the energy of the sound field yields

higher diffuseness, and enables more accurate distance estimation. This idea has been evaluated

empirically through the experiments conducted in Section 6.3.

3.3. Diffuseness Evaluation

3.3.1. Broadband Power Pattern

We consider a well-designed symmetric and regular spherical array of n microphones. The

spectral representation of the signals recorded by microphone array in Fourier domain is denoted

by S(ω) = [S 1(ω), S 2(ω), . . . , S n(ω)]T . Suppose that the beamformer weights steered towards

direction a(θ, φ) is represented byF (ω, a(θ, φ)) = [F1(ω, a(θ, φ)), F2(ω, a(θ, φ)), . . . , Fn(ω, a(θ, φ))],

the response of the array by applying the beamformer would be

Y(ω, a(θ, φ)) = F (ω, a(θ, φ))S(ω). (13)

2Sound power per unit area.8

Given Y , the directional power can be measured as Y2(ω, a(θ, φ)). The directional power can be

used to evaluate the level of diffuseness. As stated in Section 2, the power in a diffuse field is

isotropic, which indicates equal power accumulated from all directions.

In a broadband diffuse field, we can apply a filter to improve the model fitting by restricting

the broadband processing to frequencies conforming to the theoretical diffuseness bounds. The

enhanced model can then be evaluated in terms of the isotropic power distribution using a broad-

band beamformer. Given Y , the beamformer output for the spectrum of signal, the broadband

beampattern is given by

B(a(θ, φ)) = Λ

√∫Ω

Y2(ω, a(θ, φ))dω, (14)

where Ω = [ωmin, ωmax] is the frequency band of the signal and Λ is a normalization factor given

by

Λ−1 = maxθ, φ

√∫Ω

Y2(ω, a(θ, φ))dω . (15)

We can see that the broadband pattern can be interpreted as a weighted average of the beam-

former’s output over the broadband spectrum [28]. Accordingly, the broadband power-pattern

would be

P(a(θ, φ)) = |B(a(θ, φ))|2. (16)

3.3.2. Diffuseness Evaluation Measure

The appropriate application-specific criterion is necessary to evaluate the adequacy of the

diffuseness. In this section, we propose a novel approach for evaluating the diffuseness in the

room to assess the diffuseness adequacy for estimating pairwise distances. For the particular

application of microphone calibration, a pointwise diffuseness is important, which indicates that

the angular distribution of the power at any point is equal in all directions.

To measure the signal power, we propose to use a superdirective beamformer by steering

the beam toward several representative directions of the space. In real scenarios, the ambient

sound source in the environment does not have the same power at all frequencies, so it is crucial

to consider the broadband power-pattern as explained in Section 3.3.1. After normalization,

9

we have to compare the three-dimensional (3D) pattern to a sphere of radius one. To obtain the

broadband power-pattern at a particular point A in space, the microphone array has to be centered

at A. Hence, the diffuseness level is defined as

XA =3

4π

∫∫∫PA(θ,φ)

ρ2 sinφ dρ dφ dθ, (17)

where PA(θ, φ) is the measure of the power stated in (16) that is received from a direction with

azimuth θ and polar angle φ in the diffuse field at point A; ρ denotes the radial distance in the

Spherical coordinate system. XA equals 1 if we have a complete diffuse field at point A.

Computation of PA(θ, φ) is not easy and we need a 3D microphone array with a carefully

designed symmetric and regular geometry. To simplify this computation, we consider reducing

the 3D pattern to 2D by averaging over all angles φ. By defining QA(θ) as a 2D approximation

of the 3D pattern PA(θ, φ) through

QA(θ) =∫ π

0

∫ PA(θ,φ)0

ρ dρ dφ, (18)

and

ν = maxθ

QA(θ), (19)

we derive X̃A as an approximation of XA through

X̃A =1πν2

∫ 2π0

∫ QA(θ)0

r dr dθ. (20)

The approximated quantity X̃A is more practical, and it has enough accuracy for our application

as we investigate numerically in Sections 6.3 and 6.4. The conventional methods consider mere

sphericity and roundness to measure the level of diffuseness [29] whereas the proposed method is

capable of directly measuring the isotropic sound field power at any given point in space; hence,

the proposed diffuseness measure yields more accurate results.

10

4. Ad Hoc Microphone Array Calibration

4.1. Conventional Method

The following objective measure has been used to fit a sinc function for a broadband spectrum

of coherence function and estimate the pairwise distance [12]

δjil(d) =

∫ ωmaxωmin

∣∣∣∣∣∣

between two microphones i and l. The averaging method is expressed as

δil(d) =∫ ωmaxωmin

∣∣∣∣∣∣∣∣1J

J∑j=1

<(Γ

jil(ω)

) − sinc(ωdc)∣∣∣∣∣∣∣∣

2

dω ,

d̃il = arg mind

δil(d)

(23)

4.3. Outlier Detection Techniques

In practice, there is no complete diffuseness and the characteristic of the sound field changes

due to irregularities and acoustic ambiguities. This phenomenon results in outlier observations

in the coherence function which lead to a high error in pairwise distance estimation. Hence, we

propose to apply an outlier detection technique after averaging the coherence of multiple frames.

The goal of outlier detection is to increase the quality and robustness of a data analysis approach.

We consider statistical outlier detection techniques based on k-means (parametric-based) as

well as histogram (non-parametric) methods. In the parametric approach, we consider a pro-

file and unsupervised learning with certain criteria to identify the outliers in pairwise distance

estimation. More experiments show that the erroneous estimates do not conform to a specific

parametric model. Hence, we resort to a non-parametric histogram-based approach. In the his-

togramming method, the outliers are identified through a fixed threshold. In addition, this method

requires less memory and computational cost, although finding the optimal size of the bins for

a large number of attributes is a challenging task. The experimental analyses conducted in sec-

tions 6.2 and 6.4 confirm the validity of the averaging method followed by outlier removal using

histogram-based clustering for robust estimation of the pairwise distance. Furthermore, we show

that histogram clustering outperforms the k-means clustering approach. At the final step in cali-

bration of the microphones, the geometry is extracted using the s-stress method [30].

5. Fundamental Limitation of Diffuse Model

This section explains the fundamental limitations and the performance bound of distance

estimation using a diffuse field coherence model. As we have already seen earlier in the paper, the

12

spatial coherence of two signals in a diffuse field is a sinc function of the pairwise distance (11).

This function decreases quickly and, as shown in Figure 1, it disappears after one cycle. Hence,

the coherence measured in the first cycle is vital in estimation accuracy.

We consider three scenarios, being a medium size room (8 × 5.5 × 3.5 m3), a large size room

(24×16.5×10.5 m3) and a very large size room (48×33×21 m3). The second zero crossings on

the sinc function as expressed in (11) occur at 343 Hz, 114 Hz and 57 Hz for pairwise distances

of 1 m, 3 m and 6 m, respectively. Hence, diffuseness at frequencies lower than these frequencies

are important.

On the other hand, the Schroeder frequency is obtained as fs =√

6c2/αZ where α is the

average absorption coefficient of the walls with a surface area of Z [22]; therefore, for an average

absorption coefficient α = 0.07 and c = 343 m/s, the Schroeder frequencies for these three rooms

are 235 Hz, 78 Hz and 39 Hz respectively. As indicated in Section 2.2.1, a diffuse field cannot be

generated in a room with a monochromatic source under the Schroeder frequency.

The mode model can be used for computing the acoustic pressure in modal behavior. Dif-

fuseness at each frequency band can be illustrated by expansion modes. Table 1 summarizes

the number of modes for each one-third-octave band in three room sizes. Based on theory, we

hypothesize that increasing the dimension of the room increases the diffuseness, in particular at

low frequencies which are highly effective in distance estimation. In addition, by increasing the

pairwise distances, the number of discrete frequencies below the second zero crossing decreases

linearly so we speculate that a linear regression can illustrate the relationship between the errors

and distances. The empirical evaluations carried out in Sections 6.3–6.6 confirm the validity of

these hypotheses. These experiments enable formulating a relation between room dimension and

achievable distance estimation.

6. Experimental Analysis

This section presents the numerical results to evaluate the proposed theories and hypotheses.

The microphone calibration performance measure must be robust to rigid transformations (trans-

lation, rotation and reflection). Hence, we use the distance between the actual locations X and

13

estimated locations X̂ as defined in [31]

dist(X, X̂) =1n

∥∥∥LXXT L − LX̂X̂T L∥∥∥F ,L = In − (1/n)1n1Tn ,

(24)

where ‖·‖F denotes the matrix Frobenius norm. The 1n ∈ Rn is the all ones vector, In is the n × n

identity matrix and X, X̂ ∈ Rn×η, where η is the dimension of the space. The distance measure

stated in (24) is useful to compare the performance of different methods when the microphone

array geometry is fixed.

6.1. Data Recording Set-up

6.1.1. Simulation Scenarios

We simulate a medium size room of dimensions 8× 5.5× 3.5 m3, which has the same dimen-

sion of the room in the real scenario. The room is equipped with 48 omni-directional loudspeak-

ers playing independent white Gaussian noise. These are divided into 3 uniform circular arrays

with diameters of 1.5 m, 2.5 m and 1.5 m, producing the sound field. The three circular loud-

speaker arrays are parallel to the floor and located at the center of the planar area of the room at

0.1 m, 1.75 m and 3.4 m height. A uniform 8-channel circular microphone array located at center

of the room is used to record the sound field. The diameter of the array is adjusted such that the

pairwise distance between the microphones is equal to {0.1, 0.2, 0.3, . . . , 0.8}m; that corresponds

to the microphone array diameters of {0.26, 0.52, 0.78, 1.04, 1.30, 1.57, 1.83, 2.10}m. To enable

evaluations for larger distances beyond 0.8 m, the 8-channel array is replaced with a 16-channel

uniform circular microphone array with a diameter equal to {0.9, 1, . . . , 2}m. Figure 2 depicts a

top view of the simulated scenario.

In addition, for investigation of the effect of room dimension on diffuseness of the field,

and distance estimation, a large room as well as a very large room of dimensions 24 × 16.5 ×

10.5 m3 and 48 × 33 × 21 m3 such that each dimension is 3 and 6 times bigger than real scenario

are simulated. The same set-up of loudspeakers are used where the diameters are expanded by

a factor of 3 and 6 . The same microphone array is used to record the sound field. The diameter

14

of the array is adjusted such that the pairwise distance between the microphones are varied from

0.1m to 10 m.

The room impulse responses are generated with the image source model [32] using intra-

sample interpolation up to 15th order reflections. The corresponding reflection ratio, β used by

the image model was calculated via Eyring’s formula:

β = exp(−13.82/[c × (L−1x + L−1y + L−1z ) × T60]), (25)

where Lx, Ly and Lz are the room dimensions. The temperature of the room is assumed to be 20◦

Celsius, thus c = 343 m/s. In our experiments, T60 = 300 ms for the medium size room. The

direct-path propagation is discarded from the impulse response for generating a diffuse sound

field [5].

6.1.2. Real Data Scenario

In addition to the simulated recordings, we use the geometrical setup of the MONC corpus to

record the sound field in a meeting room [33]. The enclosure is a 8×5.5×3.5 m3 rectangular room

and it is moderately reverberant. It contains a centrally located 4.8 × 1.2 m2 rectangular table.

Twelve microphones are located on a planar area parallel to the floor at height of 1.15 m: Eight of

them are located on a circle with diameter 20cm and one microphone is at the origin. There are

three additional microphones at a 70cm distance from the central microphone. The microphones

are Sennheiser MKE-2-5-C omnidirectional miniature lapel microphones. The floor of the room

is covered with carpet and surrounded with plaster walls and two big windows.

The recordings were made in two scenarios: (1) Collecting the diffuse sound field of ambient

noise in the room without any additional source and (2) playing extra sounds by putting two

small loudspeaker under the table, and covering them with anti-acoustic material, so that the

direct paths between loudspeaker and microphones are prohibited to ensure diffuseness. The

microphone placement is depicted in Figure 3. The sampling rate is 48 kHz while the processing

applied for microphone calibration is based on a down-sampled signal at rate 16 kHz to reduce

the computational cost. The experiments are conducted using c = 343 m/s that corresponds to

15

20◦ Celsius temperature of the room.

6.2. Averaged Coherence Function

Figure 1 shows a real data example of the coherence of one frame (top) and the coherence

function averaged over 100 frames (bottom) along with the fitted sinc function. As we can

see, averaging is crucial prior to fitting the model by least square regression. The conventional

method [12] fitted a sinc function on a single frame followed by k-means clustering of multiple

frames to determine the distance. The numerical results show that the error of fitting a sinc

function on the averaged coherence function is 35 times smaller than the conventional method for

small distances. Furthermore, this method speeds up the calibration by a factor of 60 compared

to the k-means clustering method in terms of CPU time using the same number of frames.

6.3. Diffuseness Evaluation

The first experiments consider measuring the diffuseness with the method proposed in sec-

tion 3.3.2. A superdirective beamformer was used for measuring the power of the received signal

from all directions. Figure 4 shows the patterns for the simulated very large room at distances

2 m (top) and 5 m (bottom) from the room center. We can see that a more isotropic power is

obtained if the point of measurement is closer to the room center. Based on the definition stated

in (20), the diffuseness levels at 2m and 5m distances from the center of the room are .92 and .84

respectively, which shows that the diffuseness reduces as we get closer to the borders.

The diffuseness for the real data recorded at the meeting room without additional sources

is measured as 0.70. We increased the power by playing white Gaussian noise from the two

small loudspeakers put under the table. Figure 5 shows the pattern with the proposed sound

field augmenting method compared to the initial recordings. A more isotropic sound field is

obtained as the pattern is closer to a circle. Quantitatively, the diffuseness is improved to 0.83,

that indicates a 19% increase in diffuseness level.

Based on the diffuseness level quantified in this section and the real data distance estimation

results listed in Table 2 (explained further in the next Section 6.4), we can see that a diffuseness

16

level around 0.7 is a reasonably adequate diffuseness as we can estimate the pairwise distances

with less than 5% relative error.

As discussed in Section 3.3.2, estimation of the directional power can be accomplished by

a symmetric and uniform microphone array; that implies a carefully designed spherical (3D) or

circular (2D) array. The 2D approximation reduces and simplifies some of the computations.

We consider this level of approximation reasonable as the obtained calibration error and distance

measurement are not very sensitive to the quantified diffuseness [14]. Furthermore, the numerical

results confirm that the quantified diffuseness levels are in agreement with the distance estimation

results (Table 2).

6.4. Distance Estimation Performance

In order to estimate the pairwise distances, two microphone signals are processed using a

short time Fourier transform of 64ms frames obtained by applying the Tukey window with pa-

rameter = 0.25. The total length of each microphone signal is 30s. For each frame, we compute

the coherence function through (10) and estimate the pairwise distance by fitting a sinc func-

tion as stated in (21) and (22). In the baseline approach, each frame is processed independently,

which yields 468 point estimates of pairwise distances. To obtain a single estimate of the distance

between the two microphones, clustering is applied on the point estimates. Based on k-means

clustering, the center of the cluster with the smaller error determines the pairwise distance [12].

Using our enhanced model elaborated in Section 3.1, a sinc function is fitted to the averaged

coherence function.

We conduct the evaluations using simulated data in a controlled (almost ideal) diffuse field

in the medium and large size rooms as described in Section 6.1. Figure 6 illustrates the results.

We can see in Figure 6 (top: averaging method) that in the medium size room, the pairwise

distances smaller than 1 m can be estimated with less than or equal to 0.02 m error (the 90%

confidence interval is 0.03 m). The estimates become highly erroneous beyond 1 m. By using

the conventional method (Figure 6 top: k-means clustering), observations show that this method

is only applicable when the microphones are located in close proximity to each other (i.e., the

17

pairwise distance less than 30 cm). Figure 6 (bottom) illustrates that in the large size room, the

averaging method is effective for estimation of pairwise distances up to 3 m.

The relative error for distance estimation di can be quantified as

�i =

√√√√∑Nl=1

(d̂il−di

di

)2N

(26)

where d̂il is lth estimation of distance di and N is the number of microphone pairs with pairwise

distance di. Figure 7 shows that measure of �i is almost constant for each room and we can fit a

linear regression model on the relative error. As depicted in Figure 7 (top), the line corresponds

to 0.0164 m relative error and the residual error of the linear regression is 0.0028 for the medium

size room. In addition, we performed some evaluations in the large room set-up as described

in Section 6.1. The theories of the sinc function coherence model hold for up to 3 m pairwise

distance, which is also verified through our experiments in a diffuse field. Similar to the previous

experiment, we can fit a linear regression model on the relative error as depicted in Figure 7

(bottom). The line corresponds to 0.0124 m relative error and the residual error of the linear

regression is 0.0012. We can see that the following mathematical model holds for estimation of

pairwise distance

d̂ ∼ N(d, (d�)2) (27)

where N denotes the normal distribution and � is the mean of the relative errors in distance

estimation which is equal to 0.0164 and 0.0124 in the medium and large size rooms respectively.

A smaller � indicates that diffuseness is better realized in the larger room.

We further conduct some evaluations using real data recorded in a meeting room. Figure 8

illustrates that, for microphones 7 and 8 located 7.6 cm apart, the k-means clustering estimated

distance is 8.2 cm. Figure 9 (top) demonstrates that for microphones 11 and 5 where the dis-

tance is 77.38 cm, it is not possible to provide a reliable estimate by fitting a sinc function on a

single frame and k-means clustering. The estimated distance is 66 cm which shows more than

11 cm (14%) error.

18

The proposed averaging method enables more accurate point estimates with fewer outliers.

Figure 9 (bottom) shows fusion of the k-means method with an averaging technique for estimat-

ing the distance between microphones 11 and 5. The averaging is performed on each 5 frames

with 80% overlapping. The results shows that the percentage of outliers is reduced so the esti-

mated pairwise distance is 76.6 cm, amounting to 8 mm (1%) error.

Although the averaging method reduces the number of outliers, the k-means clustering is

not stable and it can generate the wrong winner class. Figure 10 shows distance estimation for

microphones 11 and 6. The estimated distance is 90.2 cm, whereas the correct distance is 80cm.

The winner class is wrong using k-means clustering. We propose to remove the outliers using

a histogram clustering method, which also offers computational speed advantages over the k-

means algorithm. Furthermore, as discussed in Section 4.3, it is a more appropriate technique for

removing outliers compared to k-means clustering. The two-dimensional histogram clustering is

shown in Figure 11. Note that the histogram represents the difference of the positions (distance)

of the microphones and not the positions themselves. This method is not dependent on the

absolute position of the compared microphones. In the histogram method, the bin with the largest

number of estimation points is the winner used for the final estimation. The resolution of the bins

is a critical parameter for construction of the histogram. We observed empirically that a 50 × 50

histogram provides a good estimate; it corresponds to a resolution of an average 7mm in pairwise

distance estimation. The two-dimensional histogram enables estimation of the pairwise distance

as 80.3 cm which has only 3 mm error, equal to 0.4%. We can see that this method is more

accurate than k-means clustering and it is more robust to noisy estimates in real data evaluations.

Table 2 summarizes all the results for pairwise distance estimation in the real data scenario.

The first column is the ground truth distances. The second column is the root mean square error

(RMSE) for the baseline method, and the third column is the RMSE for the boosted power diffuse

field; it shows an improvement compared to the baseline. The fourth column corresponds to the

results of using the averaging and two-dimensional histogram methods, which shows noticeable

improvement. Applying this method on the boosted power diffuse field shows an additional slight

improvement as listed in the last column. We can see that the averaging and two-dimensional

19

histogram are more important to achieve robust and accurate results. Furthermore, Figure 12

illustrates the measure of improvement using each method. We can see that although boosting

the power increases the diffuseness, the improvement in pairwise distance estimation is small

because measuring the diffuseness was done on all frequency bands, whereas only the low fre-

quency part has contribution to the distance estimation. Therefore measuring the diffuseness at

low frequencies is essential to predict the performance of distance estimation.

6.5. Array Calibration Performance

In the final section, we compare all methods for calibration of the geometry of the 9 micro-

phones using real data. Figure 13 illustrates the microphone calibration results.

The geometry of the array is extracted using the state-of-the-art s-stress [31] method by

solving the following optimization problem

X̂ = arg minX

∑(i, j)∈E

(∥∥∥xi − x j∥∥∥2 − d̃2i j)2 , (28)where E ⊆ [n] × [n] denotes the subset of the estimated pairwise distances and xi represents the

microphone location i . This method is a robust and accurate localization technique where the

search space is constrained to the Euclidean geometry. The reconstruction error for the baseline

method using the criterion stated in (24) is 8.83. The estimated error based on averaging method

is 8.04.

To further improve the performance, we use the two-dimensional histogram to remove out-

liers. We can see the improved estimates using the hybrid of averaging method and outlier de-

tection, where the averaging method is applied on five frames to estimate the pairwise distances

and to construct the two-dimensional histogram; the estimated error is 5.00. Table 3 summarizes

the results. The same number of frames is used by each method.

20

6.6. Diffuseness Adequacy for Pairwise Distance Estimation

The theory stated in section 2.2.1 asserts that the critical frequency to create a diffuse field is

inversely proportional to the dimension of the room. Hence, as the room gets larger, the critical

frequency gets smaller, and we can achieve a higher diffuseness especially at low frequencies. On

the other hand, Equation (11) shows that by increasing the pairwise distance, the sinc function

squeezes in the frequency domain; therefore, the diffuseness at low frequencies becomes highly

important for fitting the coherence function and the estimated sinc function. Hence, estimation

of large pairwise distances is difficult. Table 4 illustrates the relation between room dimension

and maximum pairwise distance estimation.

As the simulation results illustrate, increasing the dimensions by a factor of 6 enables estima-

tion of larger pairwise distances by a similar factor of 6. Therefore, in the room with dimensions

48 × 33 × 21m3, pairwise distances up to 6 m can be estimated accurately. Table 5 summarizes

the results of distance estimation for the very large room.

Comparing the simulated and real data evaluations on the medium size room shows that, in a

simulated as well as real diffuse field, we can estimate pairwise distance up to 1 m (Table 4).

Section 5 showed that between the second zero crossing frequency and the Schroeder fre-

quency for the aforementioned three rooms (medium, large and very large), there are only two

bands for distances 1m, 3m and 6m respectively (Table 1). Hence, the diffuseness generated by

a tone is very weak and we may not be able to fit the sinc function to extract these pairwise

distances.

In our particular case of using broadband signal, all bands that have more than 25 modes

generate a diffuse field [20]. Our empirical evaluations show that at least 5 diffuse field bands

below the second zero crossing frequency are necessary to achieve reasonable accuracy in dis-

tance estimation. Table 1 shows that in the medium size room, 5 bands (15–19) generate an

adequate diffuse field at frequencies below the second zero crossing; similarly, in the large room

and the very large room, the bands 10–14 and 7–11 generate adequate diffuse field distances

corresponding to 1m, 3m and 6m respectively. Based on this theory and the second column of

Table 1, estimation of 3 m distances in the medium size room are impossible. The experiments on

21

real data recordings confirm this theoretical insight. Hence, it becomes straightforward to deter-

mine the maximum distance which can be estimated using the diffuse field model. The procedure

requires extraction of the modes for the room. The minimum frequency ( f ∗) to have 5 bands gen-

erating more than 25 modes lower than f ∗ yields the maximum resolvable distance as d∗ = c/ f ∗.

Hence, as f ∗ gets smaller (i.e. room gets larger) the maximum estimated distance is increased.

The f ∗ is equal to 355, 112 and 56 Hz for the medium, large, and very large rooms respec-

tively, cf. Table 1, 19, 14 and 11 band indices. Those correspond to the maximum distances of

0.97 m, 3.06 m and 6.12 m.

7. Conclusions

In this paper, we studied the diffuse field model to enable ad hoc microphone array calibra-

tion. The analyses showed the importance of averaging the coherence function prior to fitting

the sinc function. The robustness was further improved using 2D histogram based clustering for

outlier detection. The enhanced model was shown to outperform the conventional method sig-

nificantly. The fundamental limitations of this approach were elaborated and effective strategies

were proposed to enable estimation of array geometry in an arbitrary set-up. Based on the theo-

retical as well as empirical studies on adequacy of diffuseness, a mathematical relationship was

characterized to link the room dimensions to the maximum resolvable distance using a diffuse

field model. The theory explains why larger aperture arrays can be calibrated in larger enclosures

and suggests a simple procedure to figure out the maximum distance that can be estimated using

a diffuse field coherence model.

Acknowledgments

This work has received funding from the Swiss National Science Foundation under the Na-

tional Center of Competence in Research (NCCR) on “Interactive Multi-modal Information Man-

agement” (IM2). The authors acknowledge the anonymous reviewers for the precise and helpful

comments and remarks to improve the quality and clarity of the manuscript.

22

References

[1] Q. Zou, X. Zou, M. Zhang, Z. Lin, A robust speech detection algorithm in a microphone array teleconferencing

system, in: IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, 2001.

[2] R. Cutler, Y. Rui, A. Gupta, J. Cadiz, I. Tashev, L.-w. He, A. Colburn, Z. Zhang, Z. Liu, S. Silverberg, Distributed

meetings: a meeting capture and broadcasting system, in: Proceedings of the tenth ACM international conference

on Multimedia, 2002.

[3] M. L. Seltzer, Microphone array processing for robust speech recognition, in: PhD Thesis, Carnegie Mellon Uni-

versity, 2001.

[4] A. Asaei, H. Bourlard, P. N. Garner, Sparse component analysis for speech recognition in multi-speaker environ-

ment, in: The 11th Annual Conference of the International Speech Communication Association (INTERSPEECH),

2010.

[5] M. J. Taghizadeh, P. N. Garner, H. Bourlard, H. R. Abutalebi, A. Asaei, An integrated framework for multi-channel

multi-source localization and voice activity detection, in: IEEE workshop on Hands-free Speech Communication

and Microphone Arrays, 2011.

[6] A. Asaei, M. J. Taghizadeh, M. Bahrololum, M. Ghanbari, Verified speaker localization utilizing voicing level in

split-bands, Signal Processing 89(6), 2009.

[7] A. Asaei, M. J. Taghizadeh, H. Bourlard, V. Cevher, Multi-party speech recovery exploiting structured spar-

sity models, in: The 12th Annual Conference of the International Speech Communication Association (INTER-

SPEECH), 2011.

[8] J. M. Sachar, H. F. Silverman, W. R. Patterson, Microphone position and gain calibration for a large-aperture

microphone array, IEEE Transactions on Speech and Audio Processing 13(1), 2005.

[9] V. C. Raykar, I. V. Kozintsev, R. Lienhart, Position calibration of microphones and loudspeakers in distributed

computing platforms, IEEE Transactions on Speech and Audio Processing 13(1), 2005.

[10] B. P. Flanagan, K. L. Bell, Array self-calibration with large sensor position errors, Signal Processing 81, 2001.

[11] M. Chen, Z. Liu, L. He, P. Chou, Z. Zhang, Energy-based position estimation of microphones and speakers for

ad-hoc microphone arrays, in: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,

2007.

[12] I. McCowan, M. Lincoln, I. Himawan, Microphone array shape calibration in diffuse noise fields, IEEE Transac-

tions on Audio,Speech and Language Processing 16(3), 2008.

[13] J. Bitzer, K. U. Simmer, K. Kammeyer, Theoretical noise reduction limits of the generalized sidelobe canceller

(GSC) for speech enhancement, in: IEEE International Conference on Acoustics, Speech, and Signal Processing,

1999.

[14] T. Schultz, Diffusion in reverberation rooms, Journal of Sound and Vibration 16(1), 1971.

[15] W. Chu, Spatial cross-correlation of reverberant sound fields, Journal of Sound and Vibration 62(2), 1979.

23

[16] W. T. Chu, Eigenmode analysis of the interference patterns in reverberant sound fields, Journal of the Acoustical

Society of America 68(1), 1980.

[17] T. Bravo, C. Maury, Enhancing low frequency sound transmission measurements using a synthesis method, Journal

of the Acoustical Society of America 122(2), 2007.

[18] M. R. Schroeder, The schroeder frequency revisited, Journal of the Acoustical Society of America 99(5), 1997.

[19] M. Wankling, B. Fazenda, Studies in modal density-its effect at low frequency, in: Proceedings of the Institute of

Acoustics, 2009.

[20] H. Nelisse, J. Nicolas, Characterization of a diffuse field in a reverberant room, Journal of the Acoustical Society

of America 101(6), 1997.

[21] M. R. Schroeder, Measurement of sound diffusion in reverberation chambers, Journal of the Acoustical Society of

America 31(11), 1959.

[22] M. R. Schroder, K. H. Kuttruff, On frequency response curves in rooms. comparison of experimental, theoretical

and monte carlo results for the average frequency spacing between maxima, Journal of the Acoustical Society of

America 76–80, vol. 34, 1962.

[23] R. V. Waterhouse, Interference patterns in reverberant sound fields, Journal of the Acoustical Society of America

27(2), 1955.

[24] C. F. Chien, W. W. Soroka, Spatial cross-correlation of acoustic pressures in steady and decaying reverberant sound

fields, Journal of Sound and Vibration 48(2), 1976.

[25] B. Rafaely, Spatial-temporal correlation of a diffuse sound field, Journal of the Acoustical Society of America 107.

[26] C. Morrow, Point-to-point correlation of sound pressures in reverberation chambers, Journal of Sound and Vibration

16(1), 1971.

[27] R. K. Cook, R. V. Waterhouse, R. D. Berendt, S. Edelman, M. C. Thompson, Measurement of correlations coeffi-

cients in reverberan sound fields, Journal of the Acoustical Society of America 27, 1955.

[28] M. J. Taghizadeh, P. N. Garner, H. Bourlard, Microphone array beampattern characterization for hands-free speech

applications, in: IEEE 7th Sensor Array and Multichannel Signal Processing Workshop, 2012.

[29] B. Gapinski, M. Grezelka, M. Ruck, The roundness deviation measurment with coordinate measuring machines,

Engineering Review 26(2), 2006.

[30] T. F. Cox, M. A. A. Cox, Multidimensional scaling, Chapman-Hall, 2001.

[31] I. Borg, P. J. F. Groenen, Modern multidimensional scaling theory and applications, Springer, 2005.

[32] J. B. Allen, D. A. Berkley, Image method for efficiently simulating small-room acoustics, Journal of Acoustical

Society of America 60(s1), 1979.

[33] The multichannel overlapping numbers corpus (MONC), Idiap resources available online:,

http://www.cslu.ogi.edu/corpora/monc.pdf.

24

0 1000 2000 3000 4000 5000 6000 7000

−1

−0.5

0

0.5

1

Frequency [Hz]

Co

he

ren

ce

0 1000 2000 3000 4000 5000 6000 7000

−1

−0.5

0

0.5

1

Frequency [Hz]

Co

he

ren

ce

Figure 1: (Top) Fitting a sinc function (red) on one frame of diffuse field coherence (blue); thecorrect distance is 20 cm and the estimated distance is 19.3 cm. (bottom) Fitting a sinc functionon average of 100 frames of diffuse sound field coherence; the estimated distance is 19.8 cm.

25

band f1 − f2 #modes/band #modes/band #modes/bandindex Hz (medium ) (large) (very large)

1 4.44-5.6 0 0 12 5.6-7.1 0 1 23 7.1-9 0 0 34 9-11.2 0 1 65 11.2-14 0 1 66 14-18 0 4 207 18-22.4 1 6 258 22.4-28 0 6 539 28-35.5 1 19 10010 35.5-45 2 29 20511 45-56 3 50 34012 56-71 7 100 73413 71-90 9 205 145814 90-112 18 340 258915 112-140 30 684 505416 140-180 68 1508 1112717 180-224 115 2589 1575418 224-280 206 5054 3913219 280-355 440 10611 82724

Table 1: Number of modes in the one-third-octave bands for medium size room (8 × 5.5 × 3.5 m3), largesize room (24 × 16.5 × 10.5 m3) and very large size room (48 × 33 × 21 m3).

26

0 2 4 6 80

1

2

3

4

5

6

LA 1 LA 2 LA 3 MA

Figure 2: Top view of the simulated medium size room scenario: This scenario consists of three circular16-element omni-directional loudspeaker arrays (LA) and one circular microphone array (MA) with thefollowing set-up parameters: LA1 has diameter=2.5 m located at height=1.75 m; LA2 and LA3 have diam-eters=1.5 located at height=0.1 m and 3.4 m respectively. A 16-element microphone array is depicted withdiameter=2 m and it is located at height=1.75 m. All arrays are parallel to the floor. The number of micro-phones and the diameter of the MA are varied as explained in Section 6.1.1 to generate various pairwisedistances.

27

Figure 3: Microphone placement for real data recording scenario.

28

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

Figure 4: Broadband power-pattern obtained at 2m (top) and 5m (bottom) from center of the room byaveraging over all polar angles; the scenario is synthesized in a very large room using 48 loudspeakers.

29

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

Scenario 1Scenario 2

Figure 5: Diffuseness assessment using broadband power-pattern; scenario 1: ambient source diffuse fieldand scenario 2: boosted power diffuse field by adding additional sources.

30

0 20 40 60 80 100 120 140 160 180

−5

0

5

10

15

20

25

30

35

Medium size room

Distance (cm)

Absolu

te E

rror

(cm

)

0 50 100 150 200 250 3000

1

2

3

4

Large size room

Distance (cm)

Absolu

te E

rror

(cm

)

Averaging method

k−means clustering

Figure 6: Comparison of error bars for estimation of pairwise distances in the medium (top) and large size(bottom) rooms. In the top plot, “cross” corresponds to the averaging method and “square” corresponds tothe k-means clustering. The bottom plot corresponds to the averaging method.

31

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

Distance (cm)

Re

lative

err

or

Data

Linear Regression Line

0 50 100 150 200 250 3000

0.05

0.1

0.15

0.2

Distance (cm)

Rela

tive e

rror

Data

Linear Regression Line

Figure 7: Relative error vs. distance for medium size (top) and large (bottom) rooms. The linear regressioncan be used to predict the relative error.

32

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

0.09

0.095

0.1

0.105

(8,7)

Distance (m)

Ab

so

lute

err

or

Loser class

Winner class

Means of the clusters

**

Figure 8: Baseline method: k-means clustering for microphones 7 and 8 located 7.6cm apart. Blue pointshave high errors and red points are the winners. The estimated pairwise distance is 8.21 cm.

33

0 0.2 0.4 0.6 0.8 1 1.2 1.40.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88(11,5)

Distance (m)

Ab

solu

te e

rror

Looser class

Winner class

* *


0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10.7

0.75

0.8

0.85

0.9

0.95(11,5)

Distance (m)

Ab

so

lute

err

or

Looser class

Winner class

* *


Figure 9: Distance estimation of microphones 11 and 5 using real data recordings. The ground truthis 77.38 cm. (top) Baseline method using k-means clustering on single frame coherence function. Theestimated distance is 66 cm. (bottom) k-means clustering on averaged coherence function. The estimateddistance is 76.6 cm.

34

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.150.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94(11,6)

Distance (m)

Absolu

te e

rror

Looser class

Winner classMeans of the clusters

**

Figure 10: Distance estimation of microphones 11 and 6 using averaging and k-means clustering; correctdistance is 80 cm and the estimated distance is 90.2 cm.

35

0.81

0.80.850.90

2

4

6

8

Distance (m)

(11,6) dist = 0.80299

Absolute error

Nu

mb

er

of po

ints

0

2

4

6

8*Winner bin

Figure 11: Distance estimation using averaging and two-dimensional histogram clustering; the correctdistance is 80 cm and the estimated distance is 80.3 cm.

36

Distance (cm) Baseline BP AVG+HIS AVG+BP+HIS Corresponding microphone pairs as depicted in Figure 37.65 .3 .26 .24 .20 {(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 1)}10 .37 .35 .31 .26 {(1, 9), (2, 9), (3, 9), (4, 9), (5, 9), (6, 9), (7, 9), (8, 9)}

14.14 .38 .36 .33 .29 {(1, 3), (2, 4), (3, 5), (4, 6), (5, 7), (6, 8), (7, 1), (8, 2)}18.48 .44 .4 .36 .32 {(1, 4), (2, 5), (3, 6), (4, 7), (5, 8), (6, 1), (7, 2), (8, 3)}

20 .55 .47 .45 .35 {(1, 5), (2, 6), (3, 7), (4, 8)}60 8.4 6.3 3.3 2.7 {(4, 10), (2, 11), (8, 12)}70 10.4 9.6 3.5 3.0 {(9, 10), (9, 11), (9, 12)}80 14.1 13.5 3.8 3.2 {(8, 10), (6, 11), (4, 12)}99 25.2 21.3 4.3 3.6 {(10, 11), (11, 12)}

Table 2: Root mean squared error of pairwise distance estimation using diffuse field coherence modelevaluated on real data recordings. The presented techniques include the baseline formulation [12], enhancedmodel by averaging coherence function (AV), using histogram (HIS) for removing the outliers as well asboosting the power (BP) of the sound field.

37

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

Distance (cm)

RM

SE

(cm

)

Baseline

BP

AVG+HIS

AVG+BP+HIS

Figure 12: Comparing the performance of all methods using real data recordings for pairwise distanceestimation. The baseline is k-means method. BP illustrates the results of using extra broadband sound.Furthermore, AVG+HIS shows big improvement by using averaging method and 2D histogram. FinallyAVG+HIS+BP shows the result of applying averaging, histogram and augmenting the sound field.

38

−0.1 −0.05 0 0.05 0.1 0.15

−0.1

−0.05

0

0.05

0.1

0.15

1

2

3

4

5

6

7

8

9

m

m

correct positions

averaging

averaging & clustering

clustering

Figure 13: Calibration of a 9-channel microphone array on real diffuse sound field recordings using aver-aging and a hybrid of averaging and histogram-based clustering.

39

Method ErrorBaseline 8.83

Averaging 8.04Averaging + Histogram 5.00

Table 3: Calibration results of 9 microphones.

40

Room size � Max distance (m)Medium 0.0164 1

Large 0.0124 3Very large 0.0103 6

Table 4: Maximum pairwise distance that can be estimated with relatively low error in three differentrooms based on fitting the sinc function to the average coherence function.

41

Distance (cm) RMSE400 2500 5600 10700 12800 30900 25

1000 57

Table 5: Root mean squared error of pairwise distance estimation using diffuse field coherence model fora very large size room (6 times greater than the medium size room).

42

Documents

Enhanced Di use Field Model for Ad Hoc Microphone Array ...publications.idiap.ch/downloads/papers/2014/Taghizadeh...ad hoc microphone array. However, at the core of steered high quality