IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 8, NOVEMBER 2008 1361
Union of MDCT Bases for Audio CodingEmmanuel Ravelli, Student Member, IEEE, Gal Richard, Senior Member, IEEE, and Laurent Daudet, Member, IEEE
AbstractThis paper investigates the use of sparse overcompletedecompositions for audio coding. Audio signals are decomposedover a redundant union of modified discrete cosine transform(MDCT) bases having eight different scales. This approach pro-duces a sparser decomposition than the traditional MDCT-basedorthogonal transform and allows better coding efficiency at lowbitrates. Contrary to state-of-the-art low bitrate coders, whichare based on pure parametric or hybrid representations, ourapproach is able to provide transparency. Moreover, we use abitplane encoding approach, which provides a fine-grain scalablecoder that can seamlessly operate from very low bitrates up totransparency. Objective evaluation, as well as listening tests,show that the performance of our coder is significantly betterthan a state-of-the-art transform coder at very low bitrates andhas similar performance at high bitrates. We provide a link totest soundfiles and source code to allow better evaluation andreproducibility of the results.
Index TermsAudio coding, matching pursuit, scalable coding,signal representations, sparse representations.
L OSSY audio coding removes statistical redundancy andperceptual irrelevancy in an audio signal to reduce theoverall bitrate. Statistical redundancy is reduced by using asparse signal representation such that the energy of the signalis concentrated in a few coefficients or parameters. Perceptualirrelevancy is exploited in audio coding by coarsely quantizing,or even removing, components that are imperceptible to thehuman auditory system. In this paper, we focus on the first part:we propose a new signal representation method for lossy audiocoding.
When transparency or near-transparency is required,state-of-the-art audio coders are mostly transform-basedand generally use the modified discrete cosine transform(MDCT). One example of such a coder is MPEG-2/4 Ad-vanced Audio Coding (AAC) ,  which is able to encodegeneral audio at 64 kb/s per channel with near-transparentquality , . However, MDCT-based coders are known tointroduce severe artifacts at lower bitrates (see, e.g., ) evenif numerous approaches have been proposed to reduce them(see, e.g.,  for an efficient statistical quantization scheme).
Manuscript received December 21, 2007; revised July 02, 2008. Currentversion published October 17, 2008. This work was supported in part by theFrench GIP ANR under contract ANR-06-JCJC-0027-01 DESAM and in partby EPSRC under Grant D000246/1. The associate editor coordinating thereview of this manuscript and approving it for publication was Dr. MalcolmSlaney.
E. Ravelli and L. Daudet are with the Institut Jean le Rond dAlembert-LAM,Universit Pierre et Marie Curie-Paris 6, 75015 Paris, France (e-mail: firstname.lastname@example.org; email@example.com).
G. Richard is with the TSI Department, GET-ENST (Tlcom Paris), 75014Paris, France (e-mail: firstname.lastname@example.org).
Digital Object Identifier 10.1109/TASL.2008.2004290
In this range, MDCT-based coders are nowadays outperformedby methods that use alternate signal representation based onparametric modeling. One of the most illustrative examplesof this is MPEG-4 SinuSoidal Coding (SSC) ,  which isbased on a sine+transients+noise model of the signal. Formalverification tests  show that SSC outperforms AAC at 24kb/s per channel. Finally, so-called hybrid coders combinetransform and parametric modeling, such as MPEG-4 HighEfficiency AAC (HE-AAC) ,  and 3GPP AMR-WB+. HE-AAC uses an MDCT to model the low half of thespectrum and a parametric approach to model the high-fre-quency components. AMR-WB+ combines several transformand linear prediction techniques. These hybrid approaches alsoperform better than AAC at low bitrates , . However,pure parametric or hybrid signal representation methods modelonly a subspace of the audio signals and thus are not able toprovide transparent quality, even at high bitrates.
In this paper, we propose a new signal representation basedon a union of a number of MDCT bases (typically eight) withdifferent scales. As opposed to existing methods, this signalrepresentation method is able to obtain transparency at highbitrates while giving better results than a transform-based ap-proach at low bitrates, provided that the signal is sufficientlysparse. This best-of-both-worlds approach between parametricand transform coding results however in a very significant in-crease in the computational cost for encoding, which hinderson-the-fly encoding for communications but is acceptable fornon-real-time coding applications. Our technique of encodingcoefficients provides a fully embedded bitstream resulting in ascalable coder ranging from very low bitrates to transparency.This scalability property, although not the main motivation ofour work, could be useful for applications such as transmissionof audio over variable bandwidth networks or transcoding ofaudio files.
Our approach is related to and different from existingmethods in several ways. First, it is related to transform codingand could be seen as a generalization of the transform approachsince it is based on a simultaneous use of a union of MDCTbases. This allows us to use efficient scalable encoding tech-niques used in transform coding , while producing asparser decomposition than the transform approach, and allowsbetter coding efficiency at low bitrates. Second, our approachis also related to parametric coding. We model the signal asa sum of multiscale sinusoidal atoms which is closely relatedto multiresolution sinusoidal modeling , a techniqueused, e.g., in SSC. In parametric audio coding, the sinusoidalmodel tries to match the sinusoidal content of the signal asclosely as possible, which is done using complex sinusoidalatoms with a precise estimate of amplitude, frequency, and
1558-7916/$25.00 2008 IEEE
1362 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 8, NOVEMBER 2008
Fig. 1. Block diagram of the proposed coder.
phase and a postprocessing stage that builds sinusoidal tracks.In our approach, we extract real sinusoidal atoms with onlyan amplitude parameter, and so we do not have to transmitany phase parameter. Contrary to the parametric approach, ourfrequencies are sampled from a limited range (the fast Fouriertransform (FFT) size equals the analysis window length), whichpermits transmission of frequency without requantizing. More-over, since the sinusoidal decompositions used in parametricaudio coding extract, a limited number of sinusoids from thesignal and model the residual as noise (plus perhaps transientmodeling). Clearly, the sinusoidal decompositions only modela subspace of the signal which limits their performance at highbitrates. Our approach is more general as it models the signalentirely with sinusoidal atoms, which is feasible here at a rea-sonable coding cost because the set of timefrequency atomshas a limited size, and consequently the cost to encode the indexof the selected atoms is not prohibitive. As a consequence, ithas the possibility of providing transparency at high bitrates.
The rest of the paper describes the different building blocksof our coder, illustrated in Fig. 1. First, the signal is decomposedover a union of MDCT bases with different scales. This is de-scribed in Section II. Then, the resulting coefficients are groupedand interleaved as described in Section III. Vectors of coeffi-cients are obtained and successively coded using a bitplane en-coding approach described in Section IV. It is important to notethat the proposed configuration clearly separates the signal anal-ysis stage and the coding stage, this configuration is also referredto as out-of-loop quantization or a posteriori quantization in theliterature. Finally, in Section V, we present objective and sub-jective evaluations, and conclude (Section VI) with perspectivesof future work.
II. SIGNAL DECOMPOSITION
A. Signal Model
The audio signal is decomposed over a union of MDCT baseswith analysis window lengths defined as increasing powers oftwo. In practice, for a signal sampled at 44.1 kHz, we foundthat using eight MDCT bases with window length from 128to 16 384 samples (i.e., from 2.9 to 370 ms) is a good tradeoffbetween accuracy and size of the decomposition set. Smallwindows are needed to model very sharp attacks while largewindows are useful for modeling long stationary components.Empirical tests have shown that using more MDCT baseswith window lengths larger than 16 384 does not significantlyimprove performance, but increases both the complexity andthe coding delay. While the MDCT ,  is a timefre-quency orthogonal transform widely used in state-of-the-arttransform-based audio coders, no work has been found in theaudio coding literature related to the simultaneous use of aunion of MDCT bases. However, the union of MDCT baseshas already been applied with success in other contexts such asaudio denoising .
The signal is decomposed as a weighted sum offunctions plus a residual of negligible energy . Onemay note here that corresponds to the length of the signal asthe analysis is performed on the whole signal; this is differentfrom the common approach in audio coding where the analysisis performed frame-by-frame, being the frame length. Themodel is given by
where are the weighting coefficients. The set of functionsis called the dictionary and is a union of
MDCT bases (called blocks)
where is the block index, is the frame index, is the fre-quency index, and is the h