List of acronyms - The University of Texas at Arlington – … · Web viewEach band consists of one or more vectors of 8-dimensional transform coefficients and the coefficients are

EE5359 MULTIMEDIA PROCESSING FINAL REPORT

Study and implementation of G.719 audio codec and performance analysis of

G.719 with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced

audio codec).

Student: Yashas Prakash

Student ID: 1000803680

Instructor: Dr. K. R. Rao

E-mail: [email protected]

Date: 05-9-2012.

1

Project Proposal:

Title: Study and implementation of G.719 audio codec and performance analysis of G.719

with AAC (advanced audio codec) and HE-AAC (high efficiency-advanced audio codec)

audio codecs.

Abstract:

This project describes a low-complexity full-band (20 kHz) audio coding algorithm which has

been recently standardized by ITU-T (International Telecommunication Union-

Telecommunication Standardization Sector) as Recommendation G.719 [1]. The algorithm is

designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 kHz sampling rate, operating at

32 - 128 kbps. This codec features very high audio quality and low computational complexity

and is suitable for use in applications such as videoconferencing, teleconferencing, and streaming

audio over the Internet [1]. This technology, while leading to exceptionally low complexity and

small memory footprint, results in high full band audio quality, making the codec a great choice

for any kind of communication devices, from large telepresence systems to small low-power

devices for mobile communication [2]. A comparison of the widely used AAC and HE-AAC [9]

audio codecs is carried out in terms of performance, reliability, memory requirements and

applications. A windows media audio file is encoded to 3GP, AAC, HE AAC formats using

SUPER (c) [13] software and testing of different coding schemes is carried out for performance,

encoding and decoding durations, memory requirements and compression ratios.

2

List of acronyms

AAC - Advanced audio coding

ATSC - Advanced television systems committee

AES - Audio Engineering Society

DMOS - Dynamic mean opinion score

EBU - European broadcasting union

FLVQ - Fast lattice vector quantization

HE-AAC - High efficiency advanced audio coding

HRQ - Higher rate lattice vector quantization

IMDCT - Inverse modified discrete cosine transform

ISO - International organization for standardization

ITU - International telecommunication union

JAES - Journal of the Audio Engineering Society

LC - Low complexity

LRQ - Lower rate lattice vector quantization

LFE - Low frequencies enhancement

LTP - Long term prediction

MDCT - Modified discrete cosine transform

MPEG - Moving picture experts group

3

QMF - Quadrature mirror filter

SBR - Spectral band replication

SMR - Symbolic music representation

SRS - Sample rate scalable

TDA - Time domain aliased

WMOPS - Weighted millions operations per second

An Overview of G.719 Audio Codec:

In hands-free videoconferencing and teleconferencing markets, there is strong and increasing

demand for audio coding providing the full human auditory bandwidth of 20 Hz to 20 kHz [1],

This is because:

Conferencing systems are increasingly used for more elaborate presentations, often

including music and sound effects (i.e. animal sounds, musical instruments, vehicles or

nature sounds, etc.) which occupy a wider audio band than speech. Presentations involve

remote education of music, playback of audio and video from DVDs and VCRs,

audio/video clips from PCs, and elaborate audio-visual presentations from, for example,

PowerPoint [1].

Users perceive the bandwidth of 20 Hz to 20 kHz as representing the ultimate goal for

audio bandwidth. The resulting market pressures are causing a shift in this direction, now

that sufficient IP (internet protocol) bit rate and audio coding technology are

available to deliver this. As with any audio codec for hands-free videoconferencing use,

the requirements include [1]:

Low latency (support natural conversation)

Low complexity (free cycles for video processing and other processing reduce cost)

High quality on all signal types [1].

4

Block diagram of the G.719 encoder:

Figure 1: Block diagram of the G.719 encoder [1].

In figure 1 the input signal sampled at 48 kHz is processed through a transient detector.

Depending on the detection of a transient, indicated by a flag IsTransient, a high frequency

resolution or a low frequency resolution transform is applied on the input signal frame. The

5

adaptive transform is based on a modified discrete cosine transform (MDCT) [1] in case of

stationary frames [1]. For transient frames, the MDCT is modified to obtain a higher temporal

resolution without a need for additional delay and with very little overhead in complexity.

Transient frames have a temporal resolution equivalent to 5 ms frames [1].

MDCT is defined as follows:

y(k)=transform co-efficient of input frame

~x(n)=time domain aliased signal of the input signal

Block diagram of the G.719 decoder:

Figure 2: Block diagram of the G.719 decoder [1].

6

A block diagram of the G.719 decoder is shown in Figure 2. The transient flag is first decoded

which indicates the frame configuration, i.e. stationary or transient. The spectral envelope is then

decoded and the same, bit-exact, norm adjustment and bit-allocation algorithms are used at the

decoder to re-compute the bit-allocation which is essential for decoding quantization indices of

the normalized transform coefficients [1]. After decoding the transform coefficients, the non-

coded transform coefficients (allocated zero bits) in low frequencies are regenerated by using a

spectral-fill codebook built from the decoded transform coefficients [2].

TRANSFORM COEFFICIENT QUANTIZATION:

Each band consists of one or more vectors of 8-dimensional transform coefficients and the

coefficients are normalized by the quantized norm. All 8-dimensional vectors belonging to one

band are assigned the same number of bits for quantization. A fast lattice vector quantization

(FLVQ) scheme is used to quantize the normalized coefficients in 8 dimensions. In FLVQ the

quantizer comprises two sub-quantizers: a D8-based higher rate lattice vector quantizer (HRQ)

and an RE8-based lower-rate lattice vector quantizer (LRQ). HRQ is a multi-rate quantizer

designed to quantize the transform coefficients at rates of 2 up to 9 bit/coefficient and its

codebook is based on the so-called Voronoi code for the D8 lattice [4]. D8 is a well-known

lattice and defined as:

where Z8 is the lattice which consists of all points with integer coordinates. It can be seen that D8

consists of the points having integer coordinates with an even sum. The codebook of HRQ is

constructed from a finite region of the D8 lattice and is not stored in memory. The code words are

generated by a simple algebraic method and a fast quantization algorithm is used.

7

Figure 3: Observed spectrum of different sounds, voiced speech, unvoiced speech and pop music on different audio bandwidths [3].

Figure 3 illustrates how, for some signals, a large portion of energy is beyond the wideband

frequency range. While the use of wideband speech codecs primarily addresses the requirement

of intelligibility, the perceived naturalness and experienced quality of speech can be further

enhanced by providing a larger acoustic bandwidth [3]. This is especially true in applications

such as teleconferencing where a high-fidelity representation of both speech and natural sounds

enables a much higher degree of naturalness and spontaneity. The logical step toward the sense

of being there is the coding and rendering of super wide band signals with an acoustic bandwidth

of 14 kHz. The response of ITU-T to this increased need for naturalness was standardization of

the G.722.1 Annex C extension in 2005 [2]. More recently, this has also led ITU-T to start work

8

on extensions of the G.718, G.729.1, G.722, and G.711.1 codecs to provide super-wideband

telephony as extension layers to these wideband core codecs [3].

An overview of MPEG – Advanced Audio Coding

Advanced audio coding (AAC) scheme was a joint development by Dolby, Fraunhoffer,

AT&T, Sony and Nokia [9]. It is a digital audio compression scheme for medium to high bit

rates which is not backward compatible with previous MPEG audio standards. The AAC

encoding follows a modular approach and the standard defines four profiles which can be chosen

based on factors like complexity of bit stream to be encoded, desired performance and output.

Low complexity (LC)

Main profile (MAIN)

Sample-rate scalable (SRS)

Long term prediction (LTP)

Excellent audio quality is provided by AAC and it is suitable for low bit rate high quality audio

applications. MPEG – AAC audio coder uses the AAC scheme [9].

HE – AAC [8] also known as AACPlus is a low bit rate audio coder. It is an AAC LC audio

coder enhanced with spectral band replication (SBR) technology.

AAC is a second generation coding scheme which is used for stereo and multichannel signals.

When compared to the perceptual coders, AAC provides more flexibility and uses more coding

tools [12].

The coding efficiency is enhanced by the following tools and they help attain higher quality at

lower bit rates [12].

This scheme has higher frequency resolution with the number of lines increased up to

1024 from 576.

Joint stereo coding has been improved. The bit rate can be reduced frequently owing to

the flexibility of the mid or side coding and intensity coding.

Huffman coding [12] is applied to the coder partitions.

9

An overview of spectral band replication technology in AACplus audio codec

Spectral band replication (SBR) is a new audio coding tool that significantly improves the coding

gain of perceptual coders and speech coders. Currently, there are three different audio coders

that have shown a vast improvement by the combination with SBR: MPEG-AAC, MPEG-Layer

II and MPEG-Layer III (mp3), all three being parts of the open ISO-MPEG standard. The

combination of AAC and SBR will be used in the standardized Digital audio Mondiale system,

and SBR is also being standardized within MPEG-4 [15].

Block diagram of SBR encoder:

Figure 4: Block diagram of the SBR encoder [15]

The basic layout of the SBR encoder is shown in figure 4. The input signal is initially fed to a

down-sampler, which supplies the core encoder with a time domain signal having half the

sampling frequency of the input signal. The input signal is in parallel fed to a 64-channel

analysis QMF bank. The outputs from the filter bank are complex-valued sub-band signals. The

10

sub-band signals are fed to an envelope estimator and various detectors. The outputs from the

detectors and the envelope estimator are assembled into the SBR data stream. The data is

subsequently coded using entropy coding and, in the case of multichannel signals, also channel-

redundancy coding. The coded SBR data and a bitrate control signal are then supplied to the core

encoder. The SBR encoder interacts closely with the core encoder. Information is exchanged

between the systems in order to, for example, determine the optimal cutoff frequency between

the core coder and the SBR band. The core coder finally multiplexes the SBR data stream into

the combined bitstream [15].

Block diagram of the SBR decoder:

figure 5: Block diagram of the SBR decoder [15]

Figure 5 illustrates the layout of the SBR enhanced decoder. The received bitstream is divided

into two parts: the core coder bitstream and the SBR data stream. The core bitstream is decoded

by the core decoder, and the output audio signal, typically of lowpass character, is forwarded to

the SBR decoder together with the SBR data stream. The core audio signal, sampled at half the

frequency of the original signal, is first filtered in the analysis QMF bank. The filter bank splits

the time domain signal into 32 sub-band signals. The output from the filter bank, i.e. the sub-

band signals, are complex-valued and thus oversampled by a factor of two compared to a regular

QMF bank [15].

11

SUBJECTIVE PERFORMANCE OF G.719

Subjective tests for the ITU-T G.719 Optimization/Characterization phase were performed from

mid February through early April 2008 by independent listening laboratories in American

English. According to a test plan designed by ITU-T Q7/SG12 [23] experts, the joint candidate

codec conducted two experiments as follows:

Experiment 1: Speech (clean, reverberant, and noisy)

Experiment 2: Mixed content and music

Mixed content items are representative of advertisement, film trailers, news with jingles, music

with announcements and contain speech, music, and noise. Each experiment used the “triple

stimulus/hidden reference/double blind test method” described in ITU-R Recommendation

BS.1116-1 [23]. A standard MPEG audio codec, LAME MP3 version 3.97 as found on the

LAME website was used as the reference codec in the subjective tests. The ITU-T requirement

was that the G.719 candidate codec at 32, 48, and 64kHz be proven “Not Worse Than” the

reference codec at 40, 56, and 64 kHz, respectively, with a 95% statistical confidence level. In

addition, the G.719 candidate codec at 64 kHz was also tested against the G.722.1C codecs at 48

kHz for Experiment 2. The subjective test results for the G.719 codec are shown in Figures 6-8.

Statistical analysis of the results showed that the G.719 codec met all performance requirements

specified for the subjective Optimization/Characterization test. For experiment 1 the G.719

codec was better than the reference codec at all bit rates. For experiment 2 the G.719 codec is

better than the reference codec at the lowest bit rate for all the items and at the two other bitrates

for most of the items. An additional subjective listening test for the G.719 codec was conducted

later to evaluate the quality of the codec at rates higher than those described in the ITU-T test

plan. Because the quality expectation of the codec at these high rates is high, a pre-selection of

critical items, for which the quality at the lower bit rate range was most degraded, was conducted

prior to testing. The test results are shown in Figure 6. It has been proven that transparency was

reached for critical material at 128 kHz.

12

Figure 6: subjective test results experiment 1 [7]

Figure 7: subjective test results experiment 2 [7]

13

Figure 8: additional subjective tests [7]

Algorithmic efficiency

The G.719 codec has a low complexity and a low algorithmic delay [1]. The delay depends on

the frame size of 20 milliseconds and the look-ahead of one frame used to form the transform

blocks. Hence, the algorithmic delay of the G.719 codec is 40 milliseconds. The algorithmic

delay of comparable codecs as 3GPP eAAC+ [14] and 3GPP AMR-WB+ are significantly

higher. For AMR-WB+ the algorithmic delay for mono coding is between 77.5 and 227.6 ms

depending on the internal sampling frequency. For eAAC+ the algorithmic delay is 323 ms for

mono coding with 32 kbps and 48 kHz sampling rate. In Table 1 the average and worst-case

complexity of G.719 is expressed in Weighted Millions Operations Per Second (WMOPS). The

figures are based on complexity reports using the basic operators of ITU-T STL2005 Software

Tool Library v2.2 [7]. For comparison, the complexity of the three comparable audio codecs

eAAC+, AMR-WB+ and ITU-T G.722.1C [8], the low-complexity super-wideband codec (14

kHz) that G.719 was developed from, is shown in Table 2. The memory requirements of G.719

are presented in Table 3. The delay and complexity measures show that the G.719 codec is very

efficient in terms of complexity and algorithmic delay especially when compared to eAAC+ and

AMR-WB+.

14

Frame buffering and windowing with overlap

A time-limited block of the input audio signal can be seen as windowed with a rectangular

window. The windowing that is a multiplication in the time-domain becomes in the frequency

domain a convolution and results in a large frequency spread for this window. In addition the

sampling theorem states that the maximal frequency that can be correctly represented in discrete

time is the Nyquist frequency, i.e. half of the sampling rate, otherwise aliasing occurs. For

example in a signal sampled at 48 kHz a frequency of 25 kHz, i.e. 1 kHz above the Nyquist

frequency of 24 kHz, will be analyzed as 23 kHz due to the aliasing. Due to the large frequency

spread of the rectangular window the frequency analysis can be contaminated by the aliasing. In

order to reduce the frequency spread and suppress the aliasing effect windows without sharp

discontinuities can be used. Two examples are the sine and the Hann windows, defined in [17],

that compared to the rectangular window indeed have a larger attenuation of the side lobes but

also a wider main lobe. This is illustrated in Figure 10 where the shape of the windows and the

corresponding frequency spectrum can be observed. Conclusively, there has to be a trade-off

between the possible aliasing and the frequency resolution.

Figure 9: Three window functions and their corresponding frequency spectrum. The windows are 1920 samples long at a sampling rate of 48 kHz [17]

15

In the synthesis of the analysed and encoded blocks of a processed audio signal the window

effects has to be cancelled. For example the inverse window function could be applied to the

coded time-domain blocks but there is a high possibility that artefacts can be audible near the

block edges due to discontinuities and amplification of the coding errors. In order to reduce the

block artefacts overlap-add techniques are commonly used [17].

In ITU-T G.719 the blocks of two consequent frames are windowed with a sine window of

length 2N = 1920 samples that is defined by:

The signals are processed with an overlap in the data of 50% between consecutive blocks.

The windowed signal of each block is given by:

Figure 10: G.719 buffering, windowing and transformation of an audio signal [17]

16

Figure 10 shows the buffering and windowing with the overlap of N = 960 samples between the

blocks of length 2N. The blocks are Time-Domain Aliased (TDA) into spectra of length N that

are transformed using the Discrete Cosine Transform (DCTIV). The information from the

transient detector is not used in the buffering, the windowing or the TDA but for the DCTIV,

which implies that there is a common buffering and windowing for the stationary and transient

mode. The combination of the TDA and the DCTIV is the MDCT which is further presented in

the following section.

Modified Discrete Cosine Transform

The MDCT is used in G.719 to transform the buffered and windowed signal blocks in to a

frequency representation. The transform comprises Time-Domain Aliasing (TDA) which means

that the signal blocks of 2N=1920 samples are folded (aliased) into blocks of N=960 samples.

These time-domain aliased signals of each block are then represented by N coefficients of cosine

basis functions. Due to the TDA it is not possible to reconstruct the time-domain signals from

individual MDCT spectra, but the framework of overlapped signal blocks enables perfect

reconstruction. The 50 % overlap and the properties of the windows are essential for the

reconstruction where the TDA can be cancelled with overlapped of consequent inverse-

transformed MDCT spectra. The conditions for Time-Domain Aliasing Cancellation (TDAC)

and the perfect reconstruction with the overlap-add technique. The signal blocks are overlapped

in order to avoid block artefacts. The number of frequency coefficients per time unit is thereby

increased in comparison to transformation of non-overlapped blocks. This implies that the bitrate

of coding the spectra is increased in order to avoid block artefacts. However, due to the TDA in

the MDCT the bitrate can be reduced by the corresponding factor of the overlap. This in

combination with the real frequency coefficients makes the MDCT competitive for audio coding

with a compact representation of the signals. The MDCT spectrum XMDCT [k] of the windowed

signal xw [n] is by definition obtained as:

17

Transient mode transformation

In the transient mode of G.719 the time-aliased signal block xw = Qxw is reversed in time and

divided into four sub-frames. The reversion re-creates the temporal coherence of the input signal

that was destroyed by the TDA. The first and the last sub-frames are windowed by half sine

windows with a fourth of zero padding while the second and third sub-frames are windowed with

the ordinary sine window as illustrated in Figure 11. The overlap between the windowed sub-

frames is 50% and each segment is MDCT transformed, i.e. time aliased and DCT IV-

transformed, which results in sub-spectra of length N/4. Thus the total length of the four sub-

spectra is N frequency coefficients, i.e. the transform lengths are equal in the stationary and the

transient mode of G.719.

Figure 11: Windowing of sub-frames in the transient mode [1].

Perceptual coding

In G.719 the MDCT spectra are perceptually encoded based on a psycho-acoustical model. The

model describes the human hearing system and is used in order to introduce coding errors that

are not audible. In Figure 13 the principle of the perceptual coder is illustrated. The MDCT

18

spectrum of the transformed windowed time-domain signal is split into 44 sub-vectors that

approximate the frequency resolution of the ear by increasing sub-vector lengths with increasing

frequency. The sub-vector spectra are quantized and coded based on the subvector energies, or

norms, that are weighted according to the psychoacoustical model. The coding procedure is

similar for the two time-resolution modes in G.719, but for the transient mode the spectral

coefficients of the four sub-frames are interleaved before coding to preserve the coherence of the

signal in the time-domain.

Figure 12: Block diagram of the perceptual encoder based on MDCT domain masking [17]

The norm of each sub-vector is estimated and quantized with a uniform logarithmic scalar

quantizer in 40 levels of 3 dB difference. The MDCT spectra are normalized with the quantized

norms in order to reduce the amount of information needed to describe the spectra. The

quantized norms are both differentially and Huffman encoded [3] before they are transmitted to

the decoder where they can be used to de-normalize the decoded MDCT spectra. In the next step

of encoding, bits are iteratively allocated to each sub-vector as a function of the quantized sub-

vector norms. The goal of the bit allocation is to distribute the available bits in a way that the

maximum subjective quality is obtained at a given data rate, i.e. a given number of bits.

Therefore the quantized norms of the sub-vectors are perceptually weighted to account for

psycho-acoustical masking and threshold effects. For each iteration in the allocation of bits, the

sub-vector of the largest weighted norm is found and one bit is assigned to each MDCT

19

coefficient in the corresponding sub-vector. The corresponding norm is decreased by 6 dB and

the procedure repeats until all available bits are assigned. When a sub-vector is assigned with 9

bits per coefficient the norm is set to minus infinity in order to not allocate more bits for that sub-

vector. Considering the allocated bits the normalized sub-vectors are lattice vector quantized and

Huffman coded. More information about the vector quantization can be found in [1] for G.719

specifically. In the stationary mode the amount of non-coded spectral coefficients in the sub-

vectors assigned with zero bits is estimated, quantized and included in the bit stream for

frequencies below the so-called transition frequency. The quantization indices of the norms, the

encoded sub-vector spectra and the estimated noise level form the encoded bit stream. In

addition, information about for example the coding mode (stationary or transient) and the coding

(Huffman or not) is added to the bit stream that is transmitted to the G.719 decoder.

Steps for implementation:

• Use a C-compiler such as DevC++ to compile the code. Any C-compiler can be used to

generate the executable files.

• The encoder code is executed to get encoder.exe file which is used for encoding the input

test_vectors of 32,48 and 64kbps.

• The decoder code is executed to get decoder.exe file which is used to decode the encoded

test_vectors which are of 32,48 and 64kbps respectively.

• The encoded and the decoded files are compared with each other in the console to check

if decoded file and the original test_vector are the same.

Console commands to be used

• The console commands to encode a test_vector at 32kbps is as follows:-

- g719encoder.exe –r 32000 –i *input file path\test_vector.raw –o *output file

path\test_32000_en.bs

The console commands for the decoder at the same bit rate is as follows:-

- g719decoder.exe –r 32000 –i *path of the input file\test_32000_en.bs –o *specific

path of the output file\test_32000_dec.raw

20

• Note: It is advisable to keep the encoded and decoded files in the same root folder

as it would it be very helpful to compare the files that the sound frames are

encoded and decoded correctly.

Type console command :-

- comp test_32000_dec.raw test_vector.raw

This command gives us the validation that the decoded file is in fact same as the test_vector.raw

through which the file was encoded.

• The screen shots of the above commands implemented in console is

shown.

Implementation of the encoder:

21

22

Implementation of the decoder:

23

comparison of the decoded signal with the original raw file with the same bit-rate.

Performance analysis of G.719

Table 1: Performance analysis of G.719

24

Study of MPEG-2 advanced audio codec (AAC)

The AAC audio coding is an international standard first to be developed in MPEG-2 AAC [I]

(ISOIEC 13818-7) and is the base of MPEG-4 general audio coding. MPEG-2 AAC audio

coding has become very popular and been widely used. It is applicable for a wide range of

applications from Internet audio over digital audio broadcasting to multichannel surround sound.

It achieves high compression ratio and high quality performance due to an improved time

frequency mapping compatibles with other new tools, like TNS, predictor, etc. There are three

different profiles defined in AAC. It allows trade-offs in audio quality and encoding/decoding

complexity for different applications. Among that, the LC profile can provide nearly the highest

audio quality as the main profile, but with significant savings in memory and processing

requirements [17].

25

Block diagram of the AAC encoder

Figure 13: Block diagram of the AAC encoder [18]

26

Filterbank and block switching: MDCT (modified discrete cosine transforms) is the standard transform used to convert the incoming audio signal from time domain to frequency domain.MDCT is a lapped Fourier transform based on type IV DCT. Since it is a lapped transform the number of outputs is as half as the number of inputs. This transform is very useful in signal compression application and is used in AAC and AC-3 audio codecs. The MDCT is computed using the equation below [11].

k = 0,1,….., N-1

where , Xk is the MDCT co-efficient in the frequency domain xn is the sample in the time domain

The inverse MDCT is computed by adding the consecutive overlapping blocks, thus cancelling the errors and retrieving the original signal. The formula used to compute IMDCT is given below [11].

n = 0,1,….., 2N-1

where , Xk is the MDCT co-efficient in the frequency domain yn is the sample in the time domain

The audio sample is first broken into segments called blocks. The data in these blocks are modified to provide smooth transition between blocks by applying a time domain filter called a window [10]. This is done by MDCT to the blocks. One of the challenges faced by audio coders is the election of optimal block size.

27

Figure 14: Block Switching and the window function [19]

Intermediate transition windows between the long and short windows smoothens the window switching as shown in Figure 3b. AAC handles the difficulty associated with coding audio material that vacillates between steady-state and transient signals by dynamically switching between the two block lengths: 2048-samples, and 256-samples, referred to as long blocks and short blocks, respectively [10]. The long block offers improved coding efficiency for stationary signals and the short blocks provides optimized coding capabilities for transient signals. AAC also switches between two different types of long blocks based on the window shape: sine-function and Kaiser-Bessel derived (KBD) according to the complexity of the signal. The far-off rejection is higher in KBD when compared to the sine shaped window.

This signal adaptive selection of the transform length is an important feature and is controlled by analyzing the short time variance of the incoming audio signal. The block synchronicity between two channels with different block length sequences is ensured by performing eight short transforms in a row with 50% overlap and the transition windows are used at the start and end of a short sequence. Thus the spacing between two consecutive blocks is maintained at a constant level of 2048 input samples.

Filterbank and gain control: A gain control module and a processing block containing an uniformly spaced PQF (4-band Polyphase quadrature filter) precedes the MDCT. The gain control block is used to attenuate or amplify the output of each PQF band and decreases the pre-echo effects. After performing gain control, MDCT is applied on each PQF band and the length is one quarter of that of the original MDCT.

Temporal noise shaping (TNS): Speech signals that vary with time are often a challenge to conventional transform schemes owing to the fact that quantization noise is controlled over frequency but is constant in a transform block. The TNS technique was introduced into MPEG-2 AAC to overcome this limitation. It is like a post processing step of the MDCT transform which

28

is used to create a continuous filter bank instead of a switched filter bank. This scheme provides enhanced control of the location of quantization noise within a filter bank window in the time domain. It uses the principle of duality of time and frequency domain. A prediction approach is used in the frequency domain to shape the quantization noise over time. This is done by filtering the original spectrum and then quantizing and the quantized filter coefficients are transmitted in a bitstream. This is used at the decoder end to undo the filtering resulting in a temporally shaped distribution of quantization noise in the decoded audio signal.

TNS handles signals that are between steady state and transient in nature. Quantization noise is present throughout the audio block when a transient signal lies at an end of a long block. The non-transient locations in the blocks are described due to the availability of greater amount of information allowed by TNS. This results in an increase in quantization noise of the transient, where masking will render the noise inaudible, and a decrease of quantization noise in the steady-state region of the audio block. [10].

Long term prediction (LTP): Redundancy reduction of stationary signal segments can be improved by frequency domain prediction. Stationary signals are supported in long transform blocks and not in short blocks. The predictor can be implemented by a second order backwards adaptive lattice structure which is calculated independently for every frequency line. The use of predicted values is controlled on a scale factor band basis and also depends on the prediction gain in the band. A cyclic reset mechanism which is synchronized between the encoder and decoder is used to improve the stability. Another advantage of the backwards adaptive structure of the filter is the bitstreams are sensitive to transmission errors.

LTP is a very effective tool for frequency domain prediction especially for signals which have clear pitch property. It reduces redundancy of the signal between successive coding frames. LTP implementation is simpler and it uses forward adaptive predictor making it less sensitive to round-off numerical errors in the decoder or bit error in the transmitted spectral coefficients.

Intensity stereo: Intensity stereo coding is based on an analysis of high-frequency audio perception specifically on the energy-time envelope of the region of the audio spectrum. This allows a stereo channel pair to share a single set of spectral values for the high-frequency components while preserving the sound quality. This is achieved by maintaining the unique envelope for each channel by means of a scaling operation so that each channel produces the original level after decoding [10]. In this method, the right and left signal is replaced by a signal plus directional information thus reducing the bit rate. It is a lossy coding method used primarily for low bit rates.

Prediction: The prediction module is used to represent stationary or semi-stationary parts of an audio signal and the repeated information for sequential windows can be represented by a repeat instruction thus checking on the redundancy of the signal. Short blocks are used for the non-stationary or rapidly varying signals and so prediction is used along with long blocks. The prediction process is based on a second-order backward adaptive model in which the spectral component values of the two preceding blocks are used in conjunction with each predictor. The prediction parameter is adapted on a block-by-block basis [10].

29

Mid/Side (M/S) stereo coding: M/S stereo coding is another data reduction module based on channel pair coding and is used to increase coding efficiency. In this case channel pair elements are analyzed as left/right and sum/difference signals on a block-by-block basis. In cases where the M/S channel pair can be represented by fewer bits, the spectral coefficients are coded, and a bit is set to note that the block has utilized m/s stereo coding. M/S stereo achieves a significant saving in bit rate when the signal is concentrated in the middle of the stereo image. During decoding, the decoded channel pair is de-matrixed back to its original left/right state [10]. This scheme is used for coding at higher bitrates.

Scalefactors: The inherent noise shaping in the non-linear quantizer is not sufficient to achieve acceptable audio quality. To improve audio quality the noise is shaped using scalefactors. The scalefactors increase SNR (signal to noise ratio) in certain bands by amplifying the signal in those spectral regions. The bit-allocation over frequency is modified as more bits are used to code the higher spectral values. At the decoder, original spectral values are reconstructed by transmitting the scalefactors within the bitstream. Huffman coding is used to reduce the redundancy within the scalefactor data.

Quantization and coding: Majority of the data reduction generally occurs in the quantization phase after the data has already achieved certain level of compression when passed through the previous modules. In the AAC module, the spectral data is quantized under the control of the psychoacoustic model. The number of bits used must be below a limit determined by the desired bit rate. Huffman coding is also applied 24 in the form of twelve codebooks. In order to increase coding gain, scale factors with spectral coefficients of value zero are not transmitted [10]. Adaptive quantization is the primary source of bit rate reduction and key components in the process are the quantization function and noise shaping. Non-linear quantization is used as it has implicit noise shaping when compared to the conventional linear quantizer.

Noiseless Coding: This block is used to optimize the redundancy reduction. It is nested inside the quantization and coding module. Noiseless dynamic range compression can be applied prior to Huffman coding. A value of +1/- 1 is placed in the quantized coefficient array to carry sign, while magnitude and an offset from base, to mark frequency location, are transmitted as side information. This process is only used when there is a reduction in the number of bits [10]. An efficient grouping algorithm is used to find an optimum tradeoff between the optimum table for each scalefactor band and minimizing the number of data elements to be transmitted.

The AAC decoder is shown in fig. 3c

The coding efficiency is enhanced by the following tools and they help attain higher quality at

lower bit rates.[3]

This scheme has higher frequency resolution with the number of lines increased up to 1024 from 576.

Joint stereo coding has been improved. The bit rate can be reduced frequently owing to the flexibility of the mid or side coding and intensity coding.

30

Huffman coding is applied to the coder partitions.

The following tools are used to improve the audio quality:

Enhanced block switching: Switched MDCT filterbank with an impulse response of 5.3 ms at 48 kHz sampling frequency is used. This helps in the reduction of pre-echo artifacts.[3]

TNS: An open loop prediction is done in the frequency domain which leads to noise reduction in the frequency domain. This technique enhances quality of speech at low bit-rates.

Block diagram of the AAC decoder:

Figure 15: Block diagram of the AAC decoder [18]

31

The block diagram of the decoder is shown in figure 14. The coding efficiency is enhanced by

the following tools and they help attain higher quality at lower bit rates.[3]

This scheme has higher frequency resolution with the number of lines increased up to 1024 from 576.

Joint stereo coding has been improved. The bit rate can be reduced frequently owing to the flexibility of the mid or side coding and intensity coding.

Huffman coding is applied to the coder partitions.

The following tools are used to improve the audio quality:

Enhanced block switching: Switched MDCT filterbank with an impulse response of 5.3 ms at 48 kHz sampling frequency is used. This helps in the reduction of pre-echo artifacts.[3]

TNS: An open loop prediction is done in the frequency domain which leads to noise reduction in the frequency domain. This technique enhances quality of speech at low bit-rates.

Study of HE-AAC codec

The name MPEG-4 High-Efficiency AAC (HE-AAC) refers to a family of recent audio coders

that was developed by the International Organization for Standardization/International Electro

technical Commission (ISO/IEC) Moving Picture Experts Group (MPEG) by subsequent

extension of the established Advanced Audio coding (AAC) architecture. These algorithmic

extensions facilitate a significant increase in coding efficiency relative to previous standards and

other known systems. Thus, they provide a representation for generic audio/music signals that

offers high audio quality also to applications limited in transmission bandwidth or storage

capacity, such as digital audio broadcasting and wireless music access for cellular phones. This

article presents a compact overview of the evolution, technology, and performance of the

MPEG- 4 HE-AAC coding family [20].

From the very beginning in 1992, MPEG audio coding formats and technology such as MPEG-

1/2 Layer 3, popularly referred to as “mp3,” have successfully supported and inspired new

applications for audio-only and audio-visual storage/transmission. Among these coders, the

MPEG-2 AAC scheme (see the document in the “HE-AAC Resources”sidebar) has emerged as

32

the prominent “all-round coder” and evolved into the root of most subsequent MPEG audio

coder developments. The AAC architecture was carried forward as the core of MPEG-4 Audio

coding for generic audio signals and further developed to support a range of functionalities, such

as scalability, low-delay operation (Low-Delay AAC, AAC-LD), and lossless signal

representation. Two of the latest additions aim at providing improved coding efficiency at very

low data rates and are referred to by the name HE-AAC [20].

Functionalities of HE-AAC

HE-AAC supports a broad range of compression ratios and configurations ranging from highly

efficient mono and stereo coding (typical operation point 32 kb/s stereo with HE-AAC v2) via

high-quality multichannel coding (typical operation point 160 kb/s for 5.1 configuration) to near-

transparent multichannel compression (typical operation point 320 kb/s using AAC without

extensions). Because subsequent HEAAC versions form a superset of their predecessors, HE-

AAC v2 decoding is fully compatible with AAC-only and HE-AAC v1 content.

ARCHITECTURE

The basic architecture of the HE-AAC codec is shown in Figure 16 [20]. The core of the system

is the AAC waveform codec. For increased compression efficiency, the spectral band replication

(SBR) bandwidth enhancement tool and the parametric stereo (PS) advanced stereo compression

tool are added to the system. Both SBR and PS act as preprocessing blocks at the encoder side

and postprocessing blocks at the decoder side. The bit streams created by the two tools are

seamlessly transmitted in specific, previously unused sections of the AAC bit stream. This

allows for reusability of existing AAC implementations and full decoding compatibility with

existing AAC content. An HE-AAC v2 decoder comprises all three technologies and is a

superset of an AAC or AAC+SBR = HEAAC v1 decoder. The bit stream syntax of HE-AAC

allows for up to 48 audio channels. In practice, mono, stereo, and 5.1 multichannel are the most

commonly used configurations. The PS technology is defined for stereo configurations only.

TOOLS

Here we describe the three core ingredients of HE-AAC: the AAC coding kernel, the SBR

bandwidth extension, and the PS tools.

33

AAC belongs to the class of perceptually oriented traditional waveform codecs, meaning that it

aims at reproducing the waveform of the original input audio signal with a minimum amount of

data while taking into account psychoacoustic principles to minimize the audibility of coding

effects. Other well-known representatives of this class of codecs are MPEG-1/2 Layer 2, MPEG-

1/2 Layer 3 (better known as mp3), and Dolby AC-3 (better known as Dolby Digital). Today,

AAC constitutes the most efficient waveform compression standard. Its most important tools are:

Modified Discrete Cosine Transform (MDCT) [20] filter bank using window switching:

Transforming the signal into a spectral representation is the key to apply psychoacoustic

principles and redundancy reduction algorithms for audio content. For this purpose, AAC

employs a 1,024 spectral line MDCT filter bank, creating spectra corresponding to 1,024 PCM

input samples. In the case of highly time varying signals, the filter bank time resolution can be

increased by producing a series of lower-resolution spectra each corresponding to 128 input

audio samples. Stereo Processing: Intensity and mid-side stereo processing are available to

increase the compression efficiency for stereo signals. While the former is rarely used in

practice, the latter is improved over what was available in earlier codecs like mp3. Temporal

Noise Shaping: The temporal noise shaping (TNS) tool allows the codec to shape quantization

noise in the time domain by running a prediction across frequency on the spectral data. This

avoids undesirable effects caused by the relatively coarse time resolution of the MDCT filter

bank. The TNS tool has been newly developed for AAC. Quantization and Coding: The tools to

quantize and code the spectrum are Basic architecture of the HE-AAC co similar to what is used

in mp3, with significant refinements in the entropy coding stage, resulting in improved

compression efficiency. The AAC bit stream syntax has been defined in a much more flexible

way than in earlier codecs to support various configurations and future extensions [20].

34

Spectral Band replication

Figure 16: Block diagram of HEAAC encoder and decoder which used spectral band replication

and AAC encoder and decoder respectively [20]

Bandwidth extension technology is based on the observation that usually the upper part of the

spectrum of an audio signal contributes only marginally to the “perceptual information”

contained in the signal, and that human auditory perception is less sensitive in the high frequency

range. As an example, an audio signal that has been band limited to 8 kHz is still fully

recognized by humans (although it may not sound attractive without the upper part of the

spectrum). SBR exploits this observation for the purpose of improved compression; instead of

transmitting the upper part of the spectrum with AAC, SBR regenerates it from the lower part

with the help of some low-bit-rate guidance data. For regenerating the missing high frequency

components, SBR operates in the frequency domain using a quadrature mirror filter (QMF) filter

bank analysis/synthesis system. The most important building blocks of SBR are: High-Frequency

Reconstruction: The so-called transposer generates a first estimate for the upper part of the

spectrum by copying and shifting the lower part of the transmitted spectrum. In order to generate

a high-frequency spectrum that is close to the original spectrum in its fine structure, several

provisions are available including the addition of noise, the flattening of the spectral fine

structure, and the addition of missing sinusoids. Envelope Adjustment: The upper spectrum

generated by the transposer needs to be shaped subsequently with respect to frequency and time

in order to match the original spectral envelope as closely as possible. The SBR bit stream data

controls both the operation of the high-frequency reconstruction and the envelope adjustment.

35

Depending on the specific configuration, the SBR side information rate is typically a few (e.g.,

2–3) kb/s [20].

Implementation of AAC and HE-AAC codecs

A mp3 file of 32kbps bit rate is encoded to AAC and HE-AAC file formats for 32, 48 and

64kHz each and tabulated for comparison by using super software as shown in the following

screen shots

36

Screen shot 1: Implementation of AAC [13]

37

Encoding to HE-AAC file format

Screen shot 2: Implementation of HEAAC [16]

38

Performance analysis using MUSHRA test

This test is done to assess the quality of the audio compression algorithm. Multiple stimuli with

hidden reference and anchor (MUSHRA) defined by international telecommunication union

(ITU) is a methodology employed for subjective evaluation of audio quality.

It is used to evaluate the perceived quality of the output from lossy audio compression

algorithms. The MUSHRA methodology is recommended for assessing "intermediate audio

quality". This method requires fewer participants to obtain statistically significant results owing

to the fact that all codecs are presented at the same time, on the same samples, so that a paired t-

test can be used for statistical analysis.

In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of

test samples, a hidden version of the reference and one or more anchors. The recommendation

specifies that one anchor must be a 3.5 kHz low-pass version of the reference. The purpose of the

anchor(s) is to make the scale be closer to an "absolute scale", making sure that minor artifacts

are not rated as having very bad quality.

Performance Analysis of AAC and HEAAC

Table 2: Performance analysis of AAC and HEAAC

39

Comparisons of codecs

Table 3: Comparisons of G.719 AAC and HEAAC

Conclusion: G.719 codec was successfully implemented using C programming language and

was compared with the performance of AAC and HE-AAC audio codecs. It was observed that

G.719 performed the best at lower bit-rates with best performance in terms of complexity and

hardware required for implementation, the reason behind for it to be used in telecommunication

field. Whereas AAC and HE-AAC were implemented in more complex algorithms in terms of

memory usage and complexity of hardware required for implementation and therefore found its

application in music industry.

40

References:

[1] M. Xie, P. Chu, A. Taleb and M. Briand, " A new low-complexity full band (20kHz)

audio coding standard for high-quality conversational applications", IEEE Workshop on

Applications of Signal Processing to Audio and Acoustics, pp.265-268, Oct. 2009.

[2] A. Taleb and S. Karapetkov, " The first ITU-T standard for high-quality

conversational fullband audio coding ", IEEE communications magazine, vol.47, pp.124-

130, Oct. 2009.

[3] J. Wang, B. Chen, H. He, S. Zhao and J. Kuang, " An adaptive window switching

method for ITU-T G.719 transient coding in TDA domain", IEEE International

Conference on Wireless, Mobile and Multimedia Networks, pp.298-301, Jan. 2011.

[4] J. Wang, N. Ning, X. Ji and J. Kuang, " Norm adjustment with segmental weighted

SMR for ITU-T G.719 audio codec ", IEEE International Conference on Multimedia and

Signal Processing, vol.2, pp.282-285, May 2011.

[5] K. Brandenburg and M. Bosi, “ Overview of MPEG audio: current and future

standards for low-bit-rate audio coding ” JAES, vol.45, pp.4-21, Jan./Feb. 1997.

[6] A/52 B ATSC Digital Audio Compression Standard:

http://www.atsc.org/cms/standards/a_52b.pdf

[7] F. Henn , R. Böhm and S. Meltzer, “ Spectral band replication technology and its

application in broadcasting ”, International broadcasting convention, 2003.

[8] M. Dietz and S. Meltzer, “ CT-AACPlus – a state of the art audio coding scheme ”,

Coding Tecnologies, EBU Technical review, July 2002.

[9] ISO/IEC IS 13818-7, “ Information technology – Generic coding of moving pictures

and associated audio information Part 7: advanced audio coding (AAC) ”, Jan. 2006.

41

http://www.atsc.org/cms/standards/a_52b.pdf

[10] M. Bosi and R. E. Goldberg, “ Introduction to digital audio coding standards ”,

Norwell, MA, Kluwer, 2003.

[11] H. S. Malvar, “ Signal processing with lapped transforms ”, Artech House,

Norwood, MA, 1992.

[12] D. Meares, K. Watanabe and E. Scheirer, “ Report on the MPEG-2 AAC stereo verification tests ”, ISO/IEC JTC1/SC29/WG11, Feb. 1998.

[13] Super (c) v.2012.build.50: A simplified universal player encoder and renderer, A graphic user interface to FFmpeg, Mencoder, Mplayer, x264, Musepack, Shorten audio, True audio, Wavpack, Libavcodec library and Theora/vorbis real producers plugin: www.erightsoft.com

[14] T. Ogunfunmi and M. Narasimha, “ Principles of speech coding ”, Boca Raton, FL: CRC Press, 2010.

[15] P. Ekstrand, " Bandwidth extension of audio signals by spectral band replication ", IEEE, Workshop on model based processing and coding of audio, pp.53-58, Nov. 2002.

[16] T. Johnson, " Stereo coding for ITU-T G.719 codec ", Master of science, Thesis, Uppsala university, Sweden, May 2011.

[17] T.Tsai, C. Liu and Y. Wang, "A pure-ASIC design approach for MPEG-2 AAC Audio Decoder ", Information, Communications and Signal Processing, pp1633-1636, vol. 3, dec. 2003

[18] Information technology — Generic coding of moving pictures and associated audioinformation — Part 7: Advanced Audio Coding (AAC)

[19] P. Ekstrand, " Bandwidth extension of audio signals by spectral band replication ", IEEE Benelux Workshop on Model based Processing and Coding of Audio, Nov. 2002.

[20] J. Herre and M. Dietz, " MPEG-4 High-Efficiency AAC Coding ", IEEE signal processing magazine, May 2008.

[21] Proceedings of the IEEE, Special issue on, “ Frontiers of audio visual communications convergence of broadband computing and rich media “, vol 100, number 4, Apr. 2012.

42

http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=9074

[22] Internet References– http://www.itu.int/rec/T-REC-G.719-200806-I/en – http://www.audiocoding.com/ – http:// www.polycom.com/index.html?ss=false – http://en.wikipedia.org/wiki/MUSHRA – http://sourceforge.net/project/showfiles.php?group_id=290

&package_id=309 .

43

http://en.wikipedia.org/wiki/MUSHRA

http://www.polycom.com/index.html?ss=false

http://www.polycom.com/index.html?ss=false

http://www.audiocoding.com/

http://www.itu.int/rec/T-REC-G.719-200806-I/en

Documents

List of acronyms - The University of Texas at Arlington – … · Web viewEach band consists of one or more vectors of 8-dimensional transform coefficients and the coefficients are