12
1994 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006 Feature Extraction for the Prediction of Multichannel Spatial Audio Fidelity Sunish George, Student Member, IEEE, Slawomir Zielinski, and Francis Rumsey Abstract—This paper seeks to present an algorithm for the pre- diction of frontal spatial fidelity and surround spatial fidelity of multichannel audio, which are two attributes of the subjective pa- rameter called basic audio quality. A number of features chosen to represent spectral and spatial changes were extracted from a set of recordings and used in a regression model as independent variables for the prediction of spatial fidelities. The calibration of the model was done by ridge regression using a database of scores obtained from a series of formal listening tests. The statistically significant features based on interaural cross correlation and spec- tral features found from an initial model were employed to build a simplified model and these selected features were validated. The results obtained from the validation experiment were highly cor- related with the listening test scores and had a low standard error comparable to that encountered in typical listening tests. The ap- plicability of the developed algorithm is limited to predicting the basic audio quality of low-pass filtered and down-mixed recordings (as obtained in listening tests based on a multistimulus test para- digm with reference and two anchors: a 3.5-kHz low-pass filtered signal and a mono signal). Index Terms—Frontal spatial fidelity, ridge regression, spatial feature, spectral feature, surround spatial fidelity. I. INTRODUCTION T HE PAST ten years have seen the extension of multi- channel audio from theatres to home cinema [1]. Also TV/radio broadcast and audio/video on-demand services in multichannel audio have become more popular. Several com- panies have introduced wide and varied services and products to satisfy the needs of customers. The development phase of all of these products or services normally has to pass through listening tests in order to evaluate the quality of the reproduced audio. However, such listening tests are time consuming and expensive. So-called objective methods involving physical mea- surement are an alternative solution in certain circumstances to overcome these difficulties. However, current objective methods have limitations such that they cannot entirely replace listening tests. Research leading towards the objective prediction of sound quality has been undertaken since 1979 [2]. Since then, a number of advancements and novel methods have been pro- posed as reported in [3]–[8] and [9]. In 1998, ITU’s attempt to codify a standard for the objective evaluation of audio quality Manuscript received February 1, 2006; revised July 17, 2006. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Dr. Peter Kroon. The authors are with the Institute of Sound Recording, University of Surrey, Guildford, Surrey, GU2 7XH, U.K. (e-mail: [email protected].; [email protected]; [email protected]). Digital Object Identifier 10.1109/TASL.2006.883248 resulted in a standard (ITU-R BS.1387) [10], known as Percep- tual Evaluation of Audio Quality (PEAQ) [11]. In recent years, several improvements have been proposed to PEAQ such as [12]–[16] and [17]. However, none of them except [16] addressed issues relating to the evaluation of multichannel audio quality. In [16], Torres et al. converted 5.1-channel recordings to a binaural format using head-related transfer functions (HRTFs), and the basic version of the PEAQ model was used for quality prediction. The use of such a bin- aural “front end” to PEAQ does not overcome the basic problem that PEAQ takes no account of spatial quality changes in stereo or multichannel recordings. Hence, this method is insufficient for multichannel audio quality prediction. As mentioned pre- viously, multichannel audio (especially 5.1 surround audio) has become particularly prevalent in commercial systems and applications and hence an objective method for the evaluation of multichannel audio quality has great importance. In [37], Karjalainen describes a number of features that might be used for the prediction of multichannel audio quality. However, there has been no work reported so far, in the context described here, that uses them for the prediction of multichannel audio quality. In [18], an algorithm to evaluate multichannel audio quality is proposed. However, its prediction capability is limited to some specific degradation types only (bandwidth limitation). The algorithm proposed in this paper is the first step towards a comprehensive algorithm to predict multichannel audio quality. Letowski describes the two attributes of sound quality as tim- bral quality and spatial quality [19]. Also, ITU has proposed two subattributes of basic audio quality as “Front image quality” and “Impression of surround quality” for multichannel audio with more than two audio channels [20], which can be consid- ered as two attributes of so-called sound quality as proposed by Letowski. Later in [21], Rumsey et al. demonstrated the rela- tive importance of spatial and timbral fidelities. They showed that two attributes, frontal spatial fidelity and surround spatial fidelity, were statistically significant and, therefore, important for the prediction of the basic audio quality. The aim of this paper is not to propose an algorithm to replace the existing PEAQ model with an equivalent “multi- channel PEAQ” algorithm, but to describe a number of features that can be extracted from multichannel audio recordings to represent the spatial audio quality attributes—frontal spatial fidelity and surround spatial fidelity. Frontal spatial fidelity can be defined as the global attribute that describes any and all detected differences in the “spatial impression” inside the frontal arc (see nonshaded area in Fig. 1) of the multichannel audio setup, between the reference and the evaluated recording. The definition of the surround spatial fidelity can be given as the 1558-7916/$20.00 © 2006 IEEE

Feature Extraction for the Prediction of Multichannel Spatial Audio

Embed Size (px)

Citation preview

Page 1: Feature Extraction for the Prediction of Multichannel Spatial Audio

1994 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006

Feature Extraction for the Prediction of MultichannelSpatial Audio Fidelity

Sunish George, Student Member, IEEE, Slawomir Zielinski, and Francis Rumsey

Abstract—This paper seeks to present an algorithm for the pre-diction of frontal spatial fidelity and surround spatial fidelity ofmultichannel audio, which are two attributes of the subjective pa-rameter called basic audio quality. A number of features chosento represent spectral and spatial changes were extracted from aset of recordings and used in a regression model as independentvariables for the prediction of spatial fidelities. The calibration ofthe model was done by ridge regression using a database of scoresobtained from a series of formal listening tests. The statisticallysignificant features based on interaural cross correlation and spec-tral features found from an initial model were employed to builda simplified model and these selected features were validated. Theresults obtained from the validation experiment were highly cor-related with the listening test scores and had a low standard errorcomparable to that encountered in typical listening tests. The ap-plicability of the developed algorithm is limited to predicting thebasic audio quality of low-pass filtered and down-mixed recordings(as obtained in listening tests based on a multistimulus test para-digm with reference and two anchors: a 3.5-kHz low-pass filteredsignal and a mono signal).

Index Terms—Frontal spatial fidelity, ridge regression, spatialfeature, spectral feature, surround spatial fidelity.

I. INTRODUCTION

THE PAST ten years have seen the extension of multi-channel audio from theatres to home cinema [1]. Also

TV/radio broadcast and audio/video on-demand services inmultichannel audio have become more popular. Several com-panies have introduced wide and varied services and productsto satisfy the needs of customers. The development phase ofall of these products or services normally has to pass throughlistening tests in order to evaluate the quality of the reproducedaudio. However, such listening tests are time consuming andexpensive. So-called objective methods involving physical mea-surement are an alternative solution in certain circumstancesto overcome these difficulties. However, current objectivemethods have limitations such that they cannot entirely replacelistening tests.

Research leading towards the objective prediction of soundquality has been undertaken since 1979 [2]. Since then, anumber of advancements and novel methods have been pro-posed as reported in [3]–[8] and [9]. In 1998, ITU’s attempt tocodify a standard for the objective evaluation of audio quality

Manuscript received February 1, 2006; revised July 17, 2006. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Peter Kroon.

The authors are with the Institute of Sound Recording, University of Surrey,Guildford, Surrey, GU2 7XH, U.K. (e-mail: [email protected].;[email protected]; [email protected]).

Digital Object Identifier 10.1109/TASL.2006.883248

resulted in a standard (ITU-R BS.1387) [10], known as Percep-tual Evaluation of Audio Quality (PEAQ) [11].

In recent years, several improvements have been proposedto PEAQ such as [12]–[16] and [17]. However, none of themexcept [16] addressed issues relating to the evaluation ofmultichannel audio quality. In [16], Torres et al. converted5.1-channel recordings to a binaural format using head-relatedtransfer functions (HRTFs), and the basic version of the PEAQmodel was used for quality prediction. The use of such a bin-aural “front end” to PEAQ does not overcome the basic problemthat PEAQ takes no account of spatial quality changes in stereoor multichannel recordings. Hence, this method is insufficientfor multichannel audio quality prediction. As mentioned pre-viously, multichannel audio (especially 5.1 surround audio)has become particularly prevalent in commercial systems andapplications and hence an objective method for the evaluationof multichannel audio quality has great importance. In [37],Karjalainen describes a number of features that might be usedfor the prediction of multichannel audio quality. However, therehas been no work reported so far, in the context described here,that uses them for the prediction of multichannel audio quality.In [18], an algorithm to evaluate multichannel audio qualityis proposed. However, its prediction capability is limited tosome specific degradation types only (bandwidth limitation).The algorithm proposed in this paper is the first step towards acomprehensive algorithm to predict multichannel audio quality.

Letowski describes the two attributes of sound quality as tim-bral quality and spatial quality [19]. Also, ITU has proposed twosubattributes of basic audio quality as “Front image quality”and “Impression of surround quality” for multichannel audiowith more than two audio channels [20], which can be consid-ered as two attributes of so-called sound quality as proposed byLetowski. Later in [21], Rumsey et al. demonstrated the rela-tive importance of spatial and timbral fidelities. They showedthat two attributes, frontal spatial fidelity and surround spatialfidelity, were statistically significant and, therefore, importantfor the prediction of the basic audio quality.

The aim of this paper is not to propose an algorithm toreplace the existing PEAQ model with an equivalent “multi-channel PEAQ” algorithm, but to describe a number of featuresthat can be extracted from multichannel audio recordings torepresent the spatial audio quality attributes—frontal spatialfidelity and surround spatial fidelity. Frontal spatial fidelitycan be defined as the global attribute that describes any andall detected differences in the “spatial impression” inside thefrontal arc (see nonshaded area in Fig. 1) of the multichannelaudio setup, between the reference and the evaluated recording.The definition of the surround spatial fidelity can be given as the

1558-7916/$20.00 © 2006 IEEE

Page 2: Feature Extraction for the Prediction of Multichannel Spatial Audio

GEORGE et al.: FEATURE EXTRACTION FOR THE PREDICTION OF MULTICHANNEL SPATIAL AUDIO FIDELITY 1995

Fig. 1. Multichannel audio setup: Frontal arc and the angle outside frontal arc.

global attribute that describes any and all detected differencesin the spatial impression outside the frontal arc (see shadedarea in Fig. 1) of the multichannel audio setup, between thereference and the evaluated recording.

The design of a predictor involves two phases—calibrationand validation. Calibration is the fundamental process used toachieve consistency of prediction using a set of variables and adesired output. The validation phase verifies the accuracy of thecalibration with a new set of test cases. It is important to em-phasize here that the developed algorithm was calibrated usingthe database obtained from the listening tests conforming to themodified ITU-R BS.1534 (MUSHRA) Recommendation [20]with a mono signal used as an additional anchor. Consequently,the applicability of the developed predictor is limited to thescores obtained in the context of MUSHRA conformant lis-tening tests with a mono signal used as one of the supplementaryanchors (this issue will be discussed in more detail in Sections IIand VII).

This paper is arranged as follows. In Section II, a brief de-scription of the database used for the calibration is presented.Section III is dedicated to illustrating the various features ex-tracted from the recordings for the models, and Section IV de-scribes how a subset of features has been derived. Section Vspeaks about the optimization process that is used for the cali-bration of the model and the results of calibration. The detailsof a validation experiment are offered in Section VI. The resultsof calibration and validation are discussed in Section VII. Thepaper ends with conclusions and suggestions for future work inSection VIII.

II. SUMMARY OF CALIBRATION DATABASE

The database used for the calibration was obtained from a se-ries of listening tests conducted in an ITU-R BS.1116-compliantlistening room at University of Surrey, U.K. The strategy usedin the listening tests was a modified ITU-R BS. 1534 Recom-mendation as described in [22]. A summary of the experiment isgiven in the following paragraphs and a more detailed descrip-tion can be found in [22].

Twelve program items were used for the listening tests,selected from movies, music recordings, TV programs, etc.

TABLE IDEGRADATIONS WITH BANDWIDTH LIMITATION

(FOR CALIBRATION EXPERIMENT)

TABLE IIDOWN-MIXED VERSIONS (FOR CALIBRATION EXPERIMENT)

Two types of program items were used in the calibrationdatabase—these being recordings with “F-B” and “F-F” audioscene characteristics. An audio scene with “F-F” characteristicsmeans that the front and rear channels contained clearly distin-guishable audio sources. The listening impression is similar tothat when a listener is surrounded by a group of instrumentsin an orchestra. The listening impression from a recordingwith the “F-B” characteristic is similar to that experiencedin a concert hall. That is, the front channels contain clearlyperceived audio sources and the rear channels contain mainlyreverberant sounds and room response. A detailed descriptionof these characteristics can be found in [23] and [24].

The program items were processed so as to have two types ofquality degradation (see Tables I and II), by means of bandwidthlimitation and down-mixing. For down-mixing, the algorithmspresented in [36] were used. From Table II, it can be seen that allthe down-mixed versions (except the 1/2 down-mixed record-ings) do not have surround channels after the down-mixing. Onemay question how the surround spatial fidelity of such record-ings can be evaluated when no surround channels are present(i.e., when the contents of the recordings are down-mixed to 3,2, and 1 channels in the front). However, some past experimentshave shown that even for mono or two-channel stereo repro-duction modes listeners still perceived some sort of surroundingspatial impression or envelopment, and hence they graded sur-round spatial fidelity for these recordings at a higher level thanexpected [22]. It was hypothesized that this phenomenon couldbe attributed either to interactions between the loudspeaker sig-nals and the room reflections or to the signal interactions causedby the front loudspeakers at the ears of the listener. The detailedexamination of this phenomenon, however, is beyond the scopeof this study.

Bandwidth limitation was applied to the recordings by fol-lowing two different approaches. In the first approach, all thechannels present in the recordings were processed with filters

Page 3: Feature Extraction for the Prediction of Multichannel Spatial Audio

1996 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006

Fig. 2. Grading scale used for the listening tests.

having equal cutoff frequencies. In the second approach, thechannels were processed with filters differing in cutoff frequen-cies. In total there were 138 recordings. Reference recordingsselected from the various sources were used in the listening testwith 48-kHz sampling frequency at 16-bit resolution.

The calibration database was obtained in listening tests de-signed according to a modified ITU-R BS.1534 (MUSHRA)Recommendation [20]. The main modification was that insteadof asking the listeners to assess the basic audio quality, theywere asked to grade two other attributes independently, namelyfrontal spatial fidelity and surround spatial fidelity. Each subjectwas asked to grade frontal spatial fidelity and surround spatialfidelity of the test recording by comparing the correspondingitem to the reference recording. A 100-point scale as shown inFig. 2 was used for the grading.

The MUSHRA test is a multistimulus test in which twomandatory signals have to be included in the pool of items to begraded: the hidden reference (original recording) and a 3.5-kHzlow-pass filtered version of the original recording (so calledanchor). According to the ITU-R BS.1534 recommendation,“other types of anchors showing similar types of impairmentsas the system under test can be used” [20]. Hence, an additionalmono anchor was also used in the listening tests. In line withthe recommendation, the listeners were instructed to assignthe top value from the scale (100) for the hidden reference. Inother words, regardless of the absolute perceptual magnitudeof the frontal spatial fidelity or surround spatial fidelity, thelisteners were instructed to assess the unprocessed originalrecording as 100. Although the listeners were not given anyspecific instructions as to how they should assess the spatialfidelity of the anchors, a visual inspection of the databaseobtained in the listening tests revealed that for the mono anchorthe participants gave scores spanning only the bottom range ofthe scale. Consequently, in order to mimic the way in whichthe subjects assessed the stimuli normalization of the objectivefeatures was necessary (see Sections III-A2 and III-A3). As aresult, the applicability of the algorithm described in this paperis limited to the prediction of scores obtained using the listeningtests similar in their design to the one described previously.

III. FEATURES EXTRACTED FOR THE PREDICTION

A number of features were extracted from the recordings tocheck their suitability for predicting the subjective grades. Asmentioned in the previous section, the degradations applied tothe recordings were created by bandwidth limitation and down-mixing. Therefore, the primary aim in the study reported herewas to find some features that could represent the difference be-tween the spectral and spatial characteristics of the referenceand test recording. Hence, the features extracted could be di-vided mainly into two categories—spectral features and spatialfeatures—on the basis of the characteristics that they carried.All the computations described in this section were carried outusing Matlab 7.0 in a Mac OSX 10.3.9 environment.

A. Spectral Features

To extract the spectral features, the surround recordings weredown-mixed to a mono recording by summing all the individualchannels. This mono recording was then processed in differentways to build the features as described in the following para-graphs.

Spectral centroid and spectral rolloff: The center of gravityof the magnitude spectrum of the short-time Fourier transform(STFT) of an audio signal is termed the spectral centroid [25].Spectral centroid is a measure of spectral shape and is theobjective representation of the subjective attribute “brightness”[42]. For the computation of spectral centroid and spectralrolloff (see below), the down-mixed mono signal is dividedinto frames of size 42.67 ms and the Fourier transform isperformed on each frame. The centroid and rolloff wouldvary from recording to recording depending on the type ofmusic and recording. By computing the difference feature (asdescribed in Section III-A2), the perceptual difference betweenthe reference and test recording can be modeled.

The spectral centroid is defined as the average frequencyweighted by magnitudes divided by the sum of the magnitudes.The basic calculation of this feature is given as

(1)

where is the magnitude of the Fourier transform of theframe and frequency index .

Similarly, in the basic calculation of spectral rolloff, thepoint determines where the 95% of the frame’s magnitudeis achieved [26]. Thus, is defined as the smallest value of

such that the inequality

(2)

is satisfied. Formulas (1) and (2) were applied to each frame ofthe test and reference down-mixed recordings. A set of featureswere derived as described in Sections III-A1–A3.

1) Averaged Basic Feature Across the Frames: The averagevalue of the basic features across all frames was calculated for

Page 4: Feature Extraction for the Prediction of Multichannel Spatial Audio

GEORGE et al.: FEATURE EXTRACTION FOR THE PREDICTION OF MULTICHANNEL SPATIAL AUDIO FIDELITY 1997

the reference and test recordings. The averaged feature is de-fined as

(3)

where is the frame number, is the basic feature (either spec-tral centroid or spectral rolloff) calculated for frame , andrepresents the total number of frames in the audio excerpt.

2) Difference Feature: According to the ITU-R BS. 1534Recommendation, listeners are expected to evaluate the percep-tual differences between the reference and the test recordings[20]. As a result, the scores obtained in the listening tests re-tain only information about the perceptual differences betweenthe reference and the test items, not the absolute magnitude ofthe graded attribute. Consequently, in order to mimic the waythe listeners evaluated spatial audio fidelity, there was a needto introduce some form of normalization of scores in the algo-rithm for the purpose of automatic prediction of the listening testscores. For example, the difference feature was introduced torepresent the perceptual difference between the reference andtest recording. This feature was defined as

(4)

where and are the averaged spectral centroid or spec-tral rolloff calculated for the reference and test recording.Throughout this paper, the subscripts and correspond toreference and test recordings, respectively.

3) Rescaled Feature: Visual inspection of scores obtainedfor spatial fidelities in the listening tests revealed that for a givenoriginal recording and its processed versions the scores spannedalmost the whole grading scale regardless of the actual magni-tude of perceptual differences between the reference and testrecordings (the reference recordings were always graded usingthe top value of the scale whereas the mono and 3.5-kHz an-chors were typically graded using the bottom range of the scale).This phenomenon is well known in psychology and is referredto as range equalization bias [43]. In order to mimic this phe-nomenon, it was decided to rescale the features so that for agiven original recording and its processed versions the featuresspanned the range of values from a fixed maximum value (forreference) to a fixed minimum value (for anchor). Hence, thespectral features were rescaled to a range between 100 and 20.The rescaled spectral feature was generated by rescaling the av-eraged basic feature in the following way:

(5)

where and were calculated as given in the followingexpressions:

(6)

(7)

where and are the basic spectral features computed fora hidden reference and 3.5-kHz anchor, respectively.

Fig. 3. Coherence spectrum obtained for a reference recording and a testrecording degraded by Hybrid C process.

4) Spectral Coherence Based Feature: A coherence spec-trum is obtained by computing the correlation between the cor-responding frequency indices of various frames present in thereference and test recordings. It is calculated using the followingexpression:

(8)

where was the coherence estimate between the referenceand test recording; and were the power spec-tral densities of reference and test recordings; was thecross power density of reference and test recording. ,

, and were obtained from the mono versions ofreference and test recordings. For the computation of this fea-ture, the built-in function “mscohere“ from Matlab 7.0 was used[27]. An example of the coherence spectrum calculated betweenthe mono signals obtained from a reference and test recordinglimited in bandwidth by Hybrid C strategy is given in Fig. 3. Themono signals of test recordings were made using a time-align-ment algorithm in order to compensate for the group delayscaused by low-pass filters. The value of coherence was near tounity at those frequencies where the signal had not been affectedby the low pass filtering. In the “Hybrid C” condition, differentchannels were filtered using filters with different cutoff frequen-cies (18.25 kHz for front left and right, 3.5 kHz for center and10 kHz for surround channels). As seen in the example for Hy-brid C, the filtering did not affect any channels up to 3.5 kHzof the spectra and, hence, the coherence was near to unity untilthat frequency.

From the coherence spectrum, the center of gravity of thespectrum (COH) was computed by applying (1) and by re-placing the magnitudes M by coherence values.

B. Spatial Features

The purpose of the spatial features was to represent the per-ceptual difference between the spatial characteristics of the ref-erence and the test recording. The spatial features presented herecan be categorized into two types: 1) interaural cross correlation(IACC)-based features and 2) energy-based features. For com-puting the IACC-based spatial features, the multichannel audio

Page 5: Feature Extraction for the Prediction of Multichannel Spatial Audio

1998 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006

recordings were converted into a binaural form using convolu-tion of loudspeaker signals with head related transfer functions(HRTFs) corresponding to the angles at which respective loud-speakers were positioned (see Fig. 1). For the purpose of thisstudy, the database of HRTFs measured by Gardner and Martin[28] was adopted. The energy-based features were basically theratio of energies of the loudspeaker signals computed in dif-ferent ways as described in Sections III-B4 and III-B5.

1) Broadband IACC: Interaural cross correlation coefficient(IACC) is an objective measure used in concert hall acoustics. Itis calculated from the head related impulse responses obtainedfrom a dummy head placed inside a concert hall. It is computedby the following expression [38]:

(9)

where and represent the left and right channels of bin-aural recording and argument is in the range of 1 ms. Thebasic computation of IACC was done using (9). The IACC mea-sured with music signals has different properties compared tothat obtained from impulse responses. The measurement basedon music signals depends on the average length of notes, thepresence of vibrato, and also other factors like the direct-to-re-verberant energy ratio. Measuring IACC with music has a basisin perception as it is related to the way humans perceive IACCduring a concert [39], [40]. IACC is a useful physical corre-late of source spaciousness or the subjective phenomenon of ap-parent source width [29].

In this study, the broadband IACC was calculated for each42.67-ms frames on the corresponding binaural version of ref-erence and test recordings. The cross correlation IACC betweenleft and right channels of the binaural signal was computed forthe argument ranging from 1 to 1 ms and the maximumvalue was chosen. Since the cross correlation varied widelyfrom frame to frame, it was decided to create the feature forthe prediction by averaging the maxima of IACC values in allthe frames present in a recording. This averaging resulted in abetter representation of IACC for the graded excerpt.

2) Rescaled Broadband IACC: As mentioned previously, inorder to mimic the way the listeners evaluated spatial audio fi-delities, there was a need to introduce some form of normal-ization of scores in the algorithm for the purpose of automaticprediction of listening test scores. Therefore, the IACC-basedfeature described in Section III-B1 was rescaled to the rangebetween 1 and 0 by applying the (10) which is a simplified formof (5). This equation was obtained after and were sub-stituted by the IACCs obtained for hidden reference and monoanchor in (6) and (7). The (10) was obtained after substituting

since for a mono recording, IACC would be equal to 1

(10)

where and are broadband IACC calculated for referenceand test recordings, respectively.

This feature was calculated for head positions of 0 and 90 .In the first case (0 ) the head was facing the center loudspeaker,

as shown in Fig. 1. In the second case, the head was rotated 90clock-wise.

3) Rescaled Low-Frequency Band IACCs: The calculation ofrescaled low-frequency IACCs was done in a way similar to thatmentioned in Section III-B2 above except that it was computedfor three octave filter bands with center frequencies of 500,1000, and 2000 Hz. The rescaling of the IACC was done as de-scribed in Section III-A3. A feature was formed by choosing themaxima of the three IACCs from the three bands. The rescaledIACC features were then calculated for seven dummy head ro-tations of 0 , 30 , 60 , 90 , 120 , 150 , and 180 .

4) Back-to-Front Energy Ratio: Morimoto describes a rela-tionship between the spatial impression and the loudspeaker en-ergy in a multichannel audio setup [30]. Back-to-front energyratio was selected as a feature since it was a descriptor of the en-ergy distribution in a multichannel audio soundfield and it hada significant effect on the spatial impression provided to the lis-teners. The decision to use back-to-front energy ratio instead offront-to-back one was to avoid a possibility of division by zeroif there was a zero energy in the rear channels (as described ear-lier in Table II, some down-mixed recordings do not have anysignals in the rear channels). The back-to-front energy ratio isdefined as follows:

(11)

where represents the sum of the rms levels of the signals infront speakers, whereas represents the sum of the rms levelsof the signals reproduced by the rear channels.

In addition to back-to-front ratio , another feature,called , was included in the algorithm. This new fea-ture was calculated using the same equation as before; however,

and were replaced with average rms levels in the frontand rear speakers, respectively.

5) Lateral Energy (LE): In concert hall acoustics, lateralgain is considered to be an objective measurement of lis-tener envelopment [31]. Since listener envelopment is directlyrelated to the sense of spatial impression, it was decided to checkwhether this feature could be used as a predictor of the frontalspatial fidelity or the surround spatial fidelity. The lateral gaincan be computed as follows:

(12)

where is the energy of a concert hall impulse response mea-sured through a figure-of-eight microphone and is the energyof an impulse response measured through an omnidirectionalmicrophone. The figure-of-eight microphone and the omnidi-rectional microphone are modeled using the following equation:

(13)

where represents the angle of sound incidence and is a co-efficient depending on the type of a microphone ( for anomnidirectional and for a figure-of-eight microphone).

There were two main problems making a direct implementa-tion of (12) impossible in the study. First, in contrast to concerthall acoustics where almost all measurements are undertaken

Page 6: Feature Extraction for the Prediction of Multichannel Spatial Audio

GEORGE et al.: FEATURE EXTRACTION FOR THE PREDICTION OF MULTICHANNEL SPATIAL AUDIO FIDELITY 1999

using the impulse responses, in the current study only the finalrecordings were available (e.g., musical signals) without accessto any impulse responses of the recording studios or recordingequipment. Therefore, it was necessary to use continuous sig-nals instead of impulse responses in order to calculate (13).Second, the numerator in (12) was originally intended to mea-sure the energy of late lateral reflections (hence 80 ms used asa lower integration limit [31]). In the current study, the mea-surement of late lateral reflections was not very important asmany of the recordings employed in the listening tests containedpredominantly direct audio sources panned around a listener.Hence, the original value of 80 ms used as a lower integrationlimit in the numerator was replaced by 0. Due to the significantmodifications described above, it was decided to refer to thismodified feature as a “lateral energy” (LE) as it better reflectedthe current authors’ intentions in which this parameter was used.

The lateral energy (LE) was computed from the loudspeakersignals, and the values were selected as the correspondingspeaker azimuths of the multichannel audio setup used in thelistening tests (0 , 30 , 110 , 330 , and 250 ). The upper limitof in (12) was impractical to implement and, hence, the inte-gration time was set to the duration of the recordings.

It was decided to calculate three separate features based onLE. The first feature (called ) was obtained using the di-rect calculation of LE, whereas the second feature was estimatedas the difference between the LE for reference and test record-ings . The third feature was computed using therescaling strategy presented in Section III-A3 by substituting thevalue for as the values obtained for mono anchor.

The usefulness of this new lateral energy-based feature isquestionable. Informal predictions using the LE-based featuresindependently in a linear regression model showed that theywere not very important for the prediction of frontal and sur-round spatial fidelities.

In addition to the features described in the previous para-graphs, selected interactions were also examined between LE,

, and all other direct features. Interactions were examinedby calculating products of selected features and using them inthe model. The purpose of energy-based spatial features was torepresent listener envelopment and, hence, the authors believedthat these features could significantly interact with the otherfeatures even if they failed to be useful predictors on their own.

The final list of extracted features is presented in Table III.For clarity, the table does not include any interactions betweenfeatures.

IV. SELECTION OF THE EXTRACTED FEATURES

As described in the previous section, a number of featureswere extracted from the recordings and used for the predic-tion of frontal spatial fidelity and surround spatial fidelity. Thedetails of all direct features used in the regression model arepresented in Table III. As described previously, two sets of in-teraction features were built by multiplying the selected directfeatures with back-to front-ratio and lateral energy

and used in the regression model.Some of the features extracted have similar characteristics

because they attempt to represent the same properties. Conse-quently, the results showed that many of the features (partic-

TABLE IIILIST OF EXTRACTED FEATURES

ularly IACC-based features at different head positions) werehighly correlated. If these features had been used directly in amultiple linear regression model the results would, therefore,have been biased by multicolinearity, resulting in an abnormalvariance inflation factor (VIF) [32]. There are many regressionanalysis techniques that are robust to multicolinearity, such aspartial least squares regression, principal component regression,ridge regression, etc. These methods tend to yield the same orsimilar results. For this study, ridge regression [33] was chosenas an alternative method to multiple linear regression.

In ridge regression, a shrinkage value needs to be specified.Ridge regression shrinks the regression coefficients by imposinga penalty on their size [41]. Selection of a shrinkage value ( )for ridge regression is a tradeoff between the biased coefficientsand the magnitude of standard error of estimate (SE). If theshrinkage value is chosen to be very small, the model has highcorrelation and low standard error, but the coefficients are bi-ased. This situation is very similar to that of a multiple linearregression and, thus, the model normally would not pass a vali-dation test. Hence, the selection of the shrinkage value was doneas a compromise to minimize the standard error of estimate andbiasing so as to obtain a stable model [33], [34]. The calibra-tion of the models for frontal spatial fidelity and surround spa-tial fidelity was done with shrinkage value . The macro“RIDGEREG“ available with the SPSS [35] was used to predictthe spatial fidelities.

A. Features for Frontal Spatial Fidelity Prediction

First, the prediction of frontal spatial fidelity was done. Thefrontal spatial fidelity was the dependent variable and the ex-tracted features were the independent variables. The correlationand the standard error (SE) were calculated for the prediction.These were then related to the original scale used for the eval-uation of frontal spatial fidelity (a 100-point scale as shown inFig. 2). All the standard errors mentioned throughout this paperare relative to this scale. The initial model showed a high cor-relation with a low standard error .However, after the analysis of the model it was found that not

Page 7: Feature Extraction for the Prediction of Multichannel Spatial Audio

2000 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006

TABLE IVFRONTAL SPATIAL FIDELITY (SELECTED FEATURES

AFTER THE FIRST ITERATION)

Fig. 4. Selected features from the first iteration of the frontal spatial fidelityprediction.

all of the features were important for the prediction. Hence, thedecision was made to retain only those features that exhibitedrelatively high importance in the prediction.

Table IV shows the unstandardized coefficients , stan-dardized coefficients (Beta), -values, and 95% confidence in-tervals of these features obtained from the initial model. Thevalues are also known as regression coefficients and are usedin the regression equation to do the prediction. The Beta valuesof each feature describe how well it fits in a regression modeland how important it is compared to other features. The higherthe magnitude of Beta values, the greater the importance of afeature in a regression model. Fig. 4 shows the Beta values ofimportant features with its 95% confidence intervals.

The selection of the features was done by following an itera-tive process in order to obtain a simplified model. The more fea-tures present in a regression model, the greater the cost of com-putation and model maintenance. The process of finding a suit-able model is a compromise between the number of features anderror variance. The number of iterations may vary depending onthe number of features used for the prediction and the criteriaused for selecting the important features and, hence, several it-erations may be required to produce an adequate model [33].In the first step of the iteration, those features with relativelysmall standardized coefficients were removed. The importantfeatures obtained after the first iteration are given in Table IV.Surprisingly, showed high Beta value in the first iteration.In the second iteration, important features found from the firstiteration were applied to the regression model. It was found that

TABLE VFRONTAL SPATIAL FIDELITY (MODEL AFTER FINAL ITERATION)

TABLE VISURROUND SPATIAL FIDELITY (SELECTED FEATURES

AFTER THE FIRST ITERATION)

the feature , , and showed a relatively low Betavalue compared to that of others. The low Beta value ofin the second iteration indicated that it was not a useful featureas anticipated in the previous section. Therefore, they were re-moved from the model and the remaining features were appliedto a regression model (third iteration). Throughout this process,the retained features were monitored to determine whether theywere statistically significant or not. The resulting model ob-tained after the third iteration is given in Table V. Only thefollowing features were retained in the model: , COH, ,

, , and (the interaction between theand broadband IACC measured at 90 head azimuth). Almostall of the interaction features except were found tobe unimportant. The selected feature set for the prediction offrontal spatial fidelity was statistically significant at .

Table V shows the result from the final iteration. The highvalues of Beta, 95% confidence intervals of same sign and highmagnitudes of -values in Table V support the fact that the se-lected features in the model were important and statistically sig-nificant at level.

B. Features For Surround Spatial Fidelity

The simplified subset of features for the prediction of sur-round spatial fidelity was selected using a similar procedure tothat described in the previous subsection. The first iteration re-sulted in the selected features listed in Table VI. The modelshowed a correlation of 0.96 with a standard error of 9.09. Fig. 5shows the Beta values of important features with associated 95%confidence intervals.

In the final iteration, it was found that the relevant featureschosen from the previous iteration were statistically significantat and important for the prediction of surround spatial

Page 8: Feature Extraction for the Prediction of Multichannel Spatial Audio

GEORGE et al.: FEATURE EXTRACTION FOR THE PREDICTION OF MULTICHANNEL SPATIAL AUDIO FIDELITY 2001

Fig. 5. Selected features from the first iteration of the surround spatial fidelityprediction.

Fig. 6. Scatter plot of the frontal spatial fidelity prediction.

fidelity. The high values of Beta, 95% confidence intervals withsame sign and high magnitudes of -values tell that the selectedfeatures were important and statistically significant.

Section V presents the prediction of the two spatial fidelitiesusing the subsets of features obtained from the iterative processdescribed previously.

V. RESULTS OF MODEL CALIBRATION

A. Calibration Model for Frontal Spatial Fidelity

Table V discussed above shows a list of all the important fea-tures retained in the regression model predicting frontal spa-tial fidelity. An equation was built using values in Table Vand frontal spatial fidelity was predicted. The predicted resultsshowed a correlation of with a standard error of 9.33.The graph given in Fig. 6 shows a scatter plot of the prediction.

TABLE VIISURROUND SPATIAL FIDELITY (MODEL AFTER FINAL ITERATION)

Fig. 7. Scatter plot of the surround spatial fidelity prediction.

B. Calibration Model for Surround Spatial Fidelity

The calibration procedure followed for the prediction of sur-round spatial fidelity was similar to that of frontal spatial fi-delity. The surround spatial fidelity was predicted using the re-gression equation built by values given in Table VII. Themodel showed a high correlation with a standard errorof estimate 8.87. The scatter plot of surround spatial fidelity pre-diction is shown in Fig. 7.

VI. VALIDATION EXPERIMENT

The validity of the model described previously had to bechecked in order to verify whether it could be generalized.Therefore, another listening test was conducted with a newset of audio recordings and listeners. In order to reduce thepossibility of bias, the validation experiment was conducted bya person who was not involved in the experiments that resultedin the creation of the calibration database.

According to [24], multichannel audio recordings exhibiting“F-F” characteristic are “critical” program materials for re-vealing perceptual changes in audio quality caused by down-mixing. Therefore, the validation experiment consisted ofrecordings with “F-F” audio characteristic only. However, it is

Page 9: Feature Extraction for the Prediction of Multichannel Spatial Audio

2002 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006

often hard to make a black and white distinction between “F-B”and “F-F” program types, as many items contain elements ofboth. Additional validation is required to determine the broadapplicability of the model to other types of program material.

A. Experimental Setup

Once again, the experiment for validation was conducted inan ITU-R BS.1116-compliant listening room at the Universityof Surrey. The strategy for the subjective evaluation of spatialfidelities was based on ITU-R BS. 1534 Recommendation. Thelisteners who participated in the test were final year Tonmeisterand research students from Institute of Sound Recording at theUniversity of Surrey. Fifteen listeners took part in the test.

The degradation types used for the experiment were thesame—bandwidth limitation and down-mixing. Two degra-dations were removed (Hybrid G and Hybrid H) and twoadditional down-mixes (2/1 and 1/1) were included in the test.The algorithms used for down-mixing are presented in [36].

The test was conducted in three separate sessions. The firstsession was a training and practice session, which exclusivelywas meant to give the subjects an opportunity to become fa-miliar with the test environment and learn how to interpret thescales. Also, this session enabled them to discriminate the at-tributes that they would evaluate and the attributes that they wereexpected to ignore during the test. The listeners were instructedto ignore timbral changes during the evaluation of frontal spa-tial fidelity and surround spatial fidelity. It was anticipated thatthey might get confused between the timbral changes and spa-tial fidelity changes if they were allowed to do the test of frontaland surround spatial fidelity without becoming familiar withthese changes. Hence, they were given a small test for evaluatingtimbral fidelity during the practice session. In summary, duringthe training and practice session, the listeners were given threesmall tests to evaluate timbral fidelity, frontal spatial fidelity,and surround spatial fidelity. The objective of the second andthird session was to evaluate either the frontal spatial fidelityor surround spatial fidelity. The software used for the listeningtest had instant switching capability between the reference andevaluated audio excerpts. The excerpts were looped and the fadeover during switching between audio excerpts was not notice-able. The methodology of this listening test was similar to thatof described in [22].

B. Validation of Frontal Spatial Fidelity Model

For the validation of the frontal spatial fidelity calibratedmodel, the important features selected after iteration werebuilt from the recordings used for the validation experiment.Equation obtained from the calibration experiment was usedto predict the frontal spatial fidelity scores. The features cal-culated from each recording were substituted in the equationand the predicted frontal spatial fidelity was compared againstthe listening test scores. The analysis showed that the correla-tion between the predicted and the actual scores was equal to

with a standard error of 15.39. This result could beconsidered as promising. Fig. 8 shows the scatter plot of thepredicted results and the frontal spatial fidelity scores obtainedfrom the validation listening test. Fine tuning of the model isrequired in order to get more accurate predicted scores.

Fig. 8. Scatter plot of frontal spatial fidelity with the validation scores.

Fig. 9. Scatter plot of surround spatial fidelity with the validation scores.

C. Validation of Surround Spatial Fidelity Model

The strategy followed for the validation of surround spatialfidelity was exactly the same as that for frontal spatial fidelityvalidation. The predicted results showed a correlation of 0.87with standard error of 14.04. This means that the model builtfrom the calibration experiment is capable of predicting the ac-tual listening test scores with relatively high accuracy. Fig. 9shows the scatter plot of the predicted surround spatial fidelityand the scores obtained from the validation experiment for sur-round spatial fidelity. Here, the validation test of the surroundspatial fidelity was similar to that of the frontal spatial fidelity.

Page 10: Feature Extraction for the Prediction of Multichannel Spatial Audio

GEORGE et al.: FEATURE EXTRACTION FOR THE PREDICTION OF MULTICHANNEL SPATIAL AUDIO FIDELITY 2003

Section VII seeks to analyze the results and present some ad-ditional observations. Overall, the obtained regression modelsperformed well.

VII. DISCUSSION

Since the degradations used here were of two basic types(bandwidth limitation and down-mixing), the model is limited inits ability to predict other types of spatial fidelity changes suchas changes in location of the sources when the overall width andenvelopment remains unchanged. Hence, in order to make thismodel more universally applicable, features that represent otherdegradation types would need to be included.

The model described in this paper was calibrated and vali-dated using the listening tests based on modified MUSHRA test[20] with hidden reference and two anchors: 3.5-kHz low-passsignal and mono signal. The listeners were instructed to assessthe hidden reference using the top value of the scale. Althoughthe listeners were not instructed as to how they should assess theanchors, it was observed that for the mono signal the listenersgraded both frontal spatial fidelity and surround spatial fidelityusing the bottom range of the scale (it is likely that this phenom-enon can be attributed to so called range equalization bias [43]).In order to imitate the way the subject assessed the stimuli, theobjective features derived in the model were normalized accord-ingly. This poses a significant limitation to the generalizabilityof the obtained results. The derived objective features and thedeveloped model should be applied to the data obtained fromthe listening tests designed using the procedure similar to theone described previously. Caution should be used when tryingto apply the model to the data obtained from other types oflistening tests. In such cases, the proposed features could stillprove to be valid; however, the normalization applied in this ex-periment may not be adequate and a different form of post-pro-cessing of features might be necessary.

By looking at Tables V and VII, it can be seen that theIACC-based features have an important impact on the pre-diction of frontal spatial fidelity and surround spatial fidelity.For the frontal spatial fidelity prediction, the two IACC-basedfeatures obtained at the head orientation of 0 had greaterimportance than the others. The IACC-based features outsidethe frontal arc ( , , and ) were important featuresfor the prediction of surround spatial fidelity. Notably, theIACCs measured at angles 60 , 90 , 120 , and 180 wereimportant. This suggests that measuring the IACCs at differenthead positions inside the region of attention would help in theprediction of frontal and surround spatial fidelity.

From the aforementioned tables, it can be seen that the spec-tral feature centroid of spectral coherence (COH) is also an im-portant factor for the prediction of frontal and surround spatialfidelities. In addition, for the prediction of surround spatial fi-delity, (rescaled spectral rolloff) also contributed and theimportance of this feature more than that of COH. This sug-gests that spectral changes affect both frontal and surround spa-tial fidelities. From informal predictions carried out separately,it was seen that coherence and spectral rolloff were importantfeatures for the recordings with bandwidth limitation and lessimportant for the down-mixed recordings. Also, for the frontal

spatial fidelity, coherence exhibits higher Beta values than thatfor surround spatial fidelity. The frontal image of typical pro-gram material is more complex than the surround image since ittypically carries more audio sources and the attention of the lis-tener is probably more biased towards the frontal image. Due tobandwidth limitation, the perceived location, distance and otherspatial attributes of sound sources might have changed signifi-cantly more than those in the surround image.

The interaction feature between broadband IACC at 90 andthe back-to-front energy ratio also appeared to be important inthe prediction of frontal spatial fidelity. Further work is requiredto confirm the validity of this interaction feature for the predic-tion of frontal spatial fidelity.

The standard error of prediction was near to ten for bothcalibration models. In a typical listening test, a 95% confidenceinterval of ten on a 100-point scale is common and these modelsoffer a similar level of accuracy. Considering the validationexperiment in terms of correlation and standard error, it can beseen that both the frontal and surround spatial fidelity modelsperformed well (correlation of 0.88 and 0.87, respectively,and SE approximately 15). The prediction accuracy could beimproved in future by including more features that representphantom image changes, such as changes in location anddistance of individual sources or groups thereof. This suggeststhe need for a basic form of spatial scene decomposition. Insummary, to predict the frontal spatial fidelity and surroundspatial fidelity with more precision, more effective features arerequired that represent the changes perceived in the recordingsarising from the quality degradations introduced.

VIII. CONCLUSION AND FUTURE WORK

This paper presented a number of features and their interac-tions extracted from multichannel audio recordings to predictfrontal spatial fidelity and surround spatial fidelity. The impor-tant features were chosen based on a ridge regression analysisand used for the prediction. The prediction results helped in re-vealing underlying relationships between the extracted spatialfeatures and spectral features. The results also showed that mea-suring the IACC-based features at different head positions (in-side or outside the frontal arc) can be used for the prediction offrontal and surround spatial fidelity. The study explored the rela-tionship between the selected features and perceptual attributesand, hence, these results can be useful to researchers seekingto choose features for the prediction of perceived spatial audioquality.

The applicability of the developed predictor is limitedto two types of audio processes: bandwidth limitation anddown-mixing. Since the predictor was calibrated and validatedusing a multistimulus listening test with hidden reference andtwo anchors (3.5-kHz low-pass filtered signal and mono signal),caution should be used when trying to apply the predictor todata obtained from other types of listening tests. In such cases,the features proposed in this paper could still prove to be valid;however, a different form of normalization might be required.Although additional investigation is required to generalizethe models presented here, the results may help towards thedevelopment of an extended PEAQ model for surround audio.

Page 11: Feature Extraction for the Prediction of Multichannel Spatial Audio

2004 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006

ACKNOWLEDGMENT

The authors would like to thank the Associate Editor andthe two anonymous reviewers for their constructive comments,from which the revision of this paper has benefited signifi-cantly. In addition, the authors would like to thank P. Marins,R. Conetta, R. Kassier, and M. Dewhirst for their valuablecomments and suggestions.

REFERENCES

[1] M. Davis, “History of spatial coding,” J. Audio Eng. Soc., vol. 51, no.6, pp. 554–569, Jun. 2003.

[2] M. R. Schroeder, B. S. Atal, and J. L. Hall, “Optimising digital speechcoders by exploiting masking properties of human ear,” J. Acoust. Soc.Amer., vol. 66, pp. 1647–1652, Dec. 1979.

[3] J. Karjalainen, “A new auditory model for the evaluation of soundquality of audio system,” in Proc. ICASSP, Tampa, FL, Mar. 1985, pp.608–611.

[4] K. Brandenburg, “Evaluation of quality for audio encoding at low bitrates,” in Proc. 82nd AES Conv., London, U.K., 1987, preprint 2433.

[5] T. Thiede and E. Kabot, “A new perceptual quality measure for thebit rate reduced audio,” in Proc. 100th Conv. Audio Eng. Soc., Copen-hagen, Denmark, May 1996, Preprint 4280.

[6] T. Sporer, “Objective audio signal evaluation—Applied psychoacous-tics for modeling the perceived quality of digital audio,” in Proc. 103rdConv. Audio Eng. Soc., New York, Aug. 1997, preprint 4512.

[7] J. G. Beerends and J. A. Stemerdink, “A perceptual audio quality mea-sure based on a psychoacoustic sound representation,” J. Audio Eng.Soc., vol. 40, pp. 963–978, Dec. 1992.

[8] B. Paillard, P. Mabilleau, S. Morisette, and J. Soumagne, “Perceval:Perceptual evaluation of the quality of audio signals,” J. Audio Eng.Soc., vol. 40, pp. 21–31, Jan. 1992.

[9] C. Colomes, M. Lever, J. B. Rault, and Y. F. Dehery, “A perceptualmodel applied to audio bit-rate reduction,” J. Audio Eng. Soc., vol. 43,pp. 233–240, Apr. 1995.

[10] Method for objective measurements of perceived audio quality, ITU,1998, ITU-R BS.1387-1.

[11] T. Thiede et al., “PEAQ- The ITU standard for objective measurementof perceived audio quality,” J. Audio Eng. Soc., vol. 48, no. 1/2, pp.3–29, Jan./Feb. 2000.

[12] P. Kozlowski and A. B. Dobrucki, “Proposed changes to the methods ofobjective, perceptual based evaluation of compressed speech and audiosignals,” in Proc. AES Conv. 116th Conv., Berlin, Germany, May 8–11,2004, Paper 6085.

[13] ——, “Adjustment of parameters proposed for the objective, perceptualbased evaluation methods of compressed speech and audio signals,” inProc. AES Conv. 117th Conv., San Francisco, CA, Oct. 28–31, 2004,Paper 6286.

[14] B. Feiten et al., “Audio adaptation according to usage environment andperceptual quality metrics,” IEEE Trans. Multimedia, vol. 7, no. 3, pp.446–453, Jun. 2005.

[15] J. G. A. Barbedo and A. Lopes, “Strategies to increase the applicabilityof methods for objective assessment of audio quality,” in Proc. AES116th Conv., Berlin, Germany, May 8–11, 2004, Paper 6080.

[16] S. Torres-Guijarro et al., “Coding strategies and quality measure formultichannel audio,” in Proc. AES 116th Conv., Berlin, Germany, May8–11, 2004, Paper 6114.

[17] R. Vanam and C. D. Creusere, “Evaluating low bitrate scalable audioquality using advanced version of PEAQ and energy equalisation ap-proach,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP’05), Mar. 18–23, 2005, pp. 189–192.

[18] S. George, S. Zielinski, and F. Rumsey, “Prediction of basic audioquality for multichannel audio recordings: Initial developments,” inDigital Music Res. Network Workshop and Roadmap Launch, Dec. 21,2005.

[19] T. Letowski, “Sound quality assessment: Concepts and criteria,” inProc. 87th AES Conv., NewYork, Oct. 18–21, 1989, Preprint 2825.

[20] Method for subjective listening tests of intermediate audio quality, ITU,2001, ITU-R BS. 1534.

[21] F. Rumsey, S. Zielinski, R. Kassier, and S. Bech, “On the relative im-portance of spatial and timbral fidelities in judgments of degraded mul-tichannel audio quality,” J. Acoust. Soc. Amer., vol. 118, pp. 968–976,Aug. 2005.

[22] S. Zielinski, F. Rumsey, R. Kassier, and S. Bech, “Comparison of basicaudio quality and timbral and spatial fidelity changes caused by limita-tion of bandwidth and by down-mix algorithms in 5.1 surround audiosystems,” J. Audio Eng. Soc., vol. 53, no. 3, pp. 174–192, Mar. 2005.

[23] ——, “Comparison of quality degradation effects caused by limitationof bandwidth and by down-mix algorithms in consumer multichannelaudio delivery systems,” in Proc. 114th AES Conv., Amsterdam, TheNetherlands, Mar. 22–25, 2003, Paper 5802.

[24] ——, “Effects of down-mix algorithms on quality of surround sound,”J. Audio Eng. Soc., vol. 51, no. 9, pp. 780–792, Sep. 2003.

[25] G. Tzanetakis and P. Cook, “Musical genre classification of audio sig-nals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293–302,Jul. 2002.

[26] ——, “Multifeature audio segmentation for browsing and annotation,”in Proc. IEEE Workshop Applications of Signal Process. Audio Acoust.,New Paltz, New York, Oct. 17–20, 1999.

[27] “Mscohere“ (Signal Processing Toolbox) Mathworks Help for Matlabver. 7.0 2006 [Online]. Available: http://www.mathworks.com/access/helpdesk/help/toolbox/signal/mscohere.html

[28] B. Gardner and K. Martin, HRTF Measurements of a KEMARDummy-Head Microphone 2006 [Online]. Available: http://sound.media.mit.edu/KEMAR.html

[29] F. Rumsey, Spatial Audio. Burlington, MA: Focal Press, 2001.[30] M. Morimoto, “The role of rear loudspeakers in spatial impression,” in

1Proc. 03rd AES Conv., New York, Sep. 26–29, 1997.[31] J. S. Bradley and G. A. Soulodre, “Objective measures of listener en-

velopment,” J. Acoust. Soc. Amer., vol. 98, no. 5, pt. 1, pp. 2590–2597,Nov. 1995.

[32] A. Field, Discovering statistics using SPSS, 2nd ed. Thousand Oaks,CA: Sage, 2005.

[33] D. Montgomery et al., Introduction to Linear Regression Analysis, 3rded. New York: Wiley Interscience, 2001.

[34] M. Hansen, Lecture4: Selection v. shrinkage Dept. Statistics, Univ.California, Los Angeles, 2006 [Online]. Available: http://www.stat.ucla.edu/~cocteau/stat120b/lectures/lecture4.pdf

[35] D. Wright, Ridge regression in SPSS Dept. Psychology, Univ. Sussex,Brighton, U.K., 2006 [Online]. Available: http://www.sussex.ac.uk/Users/danw/ESM/ridge_regression_in_spss.htm

[36] Multichannel stereophonic sound system with or without accompa-nying picture, ITU, 1992–1994, ITU-R BS. 775.-1.

[37] M. Karjalainen, “A binaural auditory model for sound quality measure-ments and spatial hearing studies,” in Proc. ICASSP, Atlanta, GA, May7–10, 1996, vol. 2, pp. 985–988.

[38] T. Hidaka, L. L. Beranek, and T. Okano, “Interaural cross-correlation,lateral fraction and low and high-frequency sound levels as measuresof acoustical quality in concert halls,” J. Acoust. Soc. Amer., vol. 98,no. 2, pt. 1, pp. 988–1005, Aug. 1995.

[39] R. Mason, T. Brookes, and F. Rumsey, “Frequency dependency of therelationship between perceived auditory source width and the interauralcross-correlation coefficient for time-invariant stimuli,” J. Acoust. Soc.Amer., vol. 117, no. 3, pt. 1, pp. 1337–1350, Mar. 2005.

[40] D. Griesinger, “The psychoacoustics of apparent source width, spa-ciousness and envelopment in performance spaces,” Acustica, pp.721–731, 1997.

[41] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference and Prediction, ser. Springer seriesin Statistics. New York: Springer-Verlag, 2001.

[42] J. W. Beauchamp, “Synthesis by spectral amplitude and “brightness“matching of analyzed musical instrument tones,” J. Audio Eng. Soc.,vol. 30, no. 6, pp. 396–406, Jun. 1982.

[43] E. C. Poulton, Bias in Quantifying Judgments. Mahwah, NJ:Lawrence Erlbaum, 1989.

Sunish George (S’06) received the B.Tech. degreefrom Cochin University of Science and Technology,Kerala, India, in 1999, and the M.Tech. degree indigital electronics and advanced communicationfrom Manipal Institute of Technology, Karnataka,India, in 2003. After his graduation, he workedin various Indian software companies developingdigital signal processing-based applications. He iscurrently pursuing the Ph.D. degree at the Universityof Surrey, Guildford, Surrey, U.K. The theme ofhis Ph.D. project is related to the development of

methods for objective evaluation of multichannel audio quality.Mr. George is a student member of the Audio Engineering Society.

Page 12: Feature Extraction for the Prediction of Multichannel Spatial Audio

GEORGE et al.: FEATURE EXTRACTION FOR THE PREDICTION OF MULTICHANNEL SPATIAL AUDIO FIDELITY 2005

Slawomir Zielinski received the M.Sc.Eng. degreein telecommunications and the Ph.D. degree in1997, both from the Technical University of Gdansk,Gdansk, Poland.

He is a Lecturer in Sound Recording at the Univer-sity of Surrey, Guildford, Surrey, U.K., where he wasa Research Fellow and also as a Lecturer at the Tech-nical University of Gdansk. Currently, he is involvedin several research projects investigating methodolo-gies for subjective and objective evaluation of audioquality in the context of multichannel audio systems.

In the past, he participated in a number of projects in the area of sound synthesis,audio processing, and multimedia systems.

Dr. Zielinski is a member of the Audio Engineering Society.

Francis Rumsey received the B.Mus. Tonmeister de-gree (with first class honors) in music with appliedphysics in 1983 and the Ph.D. degree from the Uni-versity of Surrey (UniS), Guildford, Surrey, U.K., in1991

He is a Professor and Director of Research atthe Institute of Sound Recording, UniS, and wasa Visiting Professor at the School of Music, Piteå,Sweden, from 1998 to 2004. He subsequentlyworked with Sony Broadcast in training and productmanagement. He was appointed as a Lecturer at

UniS in 1986. He is the author of over 100 books, book chapters, papers andarticles on audio. His book, Spatial Audio, was published in 2001 by FocalPress. He was a partner in EUREKA project 1653 (MEDUSA), studying theoptimization of consumer multichannel surround sound. He is currently leadinga project funded by the EPSRC concerned with predicting the perceived qualityof spatial audio systems, in collaboration with Bang & Olufsen and BBCResearch and Development.

Prof. Rumsey was the winner of the 1985 BKSTS Dennis Wratten JournalAward, the 1986 Royal Television Society Lecture Award, and the 1993 Univer-sity Teaching and Learning Prize. In 1995, he was made a Fellow of the AES forhis significant contributions to audio education. He has served on the AES Boardof Governors, was Chairman of the British Section (1992–1993), and has beenAES Vice President, Northern Europe (1995–1997). He is currently Chairmanof the AES Technical Committee on Multichannel and Binaural Audio Tech-nology and is also Chair of the AES Membership Committee.