mpeg1 audio

Multimedia Signals andSystems

MP3 - Mpeg 1,2 layer 1,2,3Polyphase Filterbank

Kunio Takaya

Electrical and Computer Engineering

University of Saskatchewan

March 31, 2008

1

“A review of algorithms for perceptual coding of digital audio

signals”

Painter, T. Spanias, A.

Dept. of Electr. Eng., Arizona State Univ., Tempe, AZ;

http://ieeexplore.ieee.org/iel3/4961/13644/00628010.pdf?arnumber=628010

MP3’ Tech - Encoding engines source codes:

http://www.mp3-tech.org/programmer/encoding.html

“ECE-700 Filterbank Notes.”, Why Filterbanks? Sub-band

Processing:

Phil Schniter, Ohio State Univ. March 10, 2008. 1

http://www.ece.osu.edu/˜ schniter/ee700/handouts/filterbanks.pdf

** Go to full-screen mode now by hitting CTRL-L

2

http://www.ece.osu.edu/~schniter/ee700/handouts/filterbanks.pdf

http://www.mp3-tech.org/programmer/encoding.html

http://ieeexplore.ieee.org/iel3/4961/13644/00628010.pdf?arnumber=628010

1 Polyphase Filter Bank

References

1. Phil Schneiter, ECE-700 Filterbank Notes

2. Davis Yen Pan, Digital Audio Compression

3. Davis Pan, A Tutorial on MPEG/Audio Compression

4. CD 11172-3 CODING OF MOVING PICTURES AND

ASSOCIATED AUDIO FOR DIGITAL STORAGE MEDIA AT

UP TO ABOUT 1.5 MBIT/s Part 3 AUDIO

5. Jong-Hwa Kim, Lossless Wideband Audio Compression:

Prediction and Transform, Ph.D. Thesis

3

http://edocs.tu-berlin.de/diss/2003/kim_jonghwa.pdf

http://edocs.tu-berlin.de/diss/2003/kim_jonghwa.pdf

http://le-hacker.org/hacks/mpeg-drafts/11172-3.pdf



http://www.cs.columbia.edu/~coms6181/slides/6R/mpegaud.pdf

http://www.digital-audio.net/res/docs/pdf/Digital_Audio_Compression_01oct1993DTJA03P8.pdf

http://www.ece.osu.edu/~schniter/ee700/handouts/filterbanks.pdf

• In MPEG audio coding, a psychoacoustic model is used to

decide how much quantization error can be tolerated in each

sub-band, while signals below the hearing threshold of a

human listener is discarded.

• In the sub-bands that can tolerate more error, less bits are

used for coding. The quantized subband signals can then be

decoded and recombined to reconstruct (an approximate

version of) the input signal.

• Such processing allows, on average, a 12-to-1 reduction in bit

rate while still maintaining CD quality audio.

• The psychoacoustic model takes into account the spectral

masking phenomenon of the human ear, which says that high

energy in one spectral region will limit the ear’s ability to hear

details in nearby spectral regions. Therefore, when the energy

in one sub-band is high, nearby subbands can be coded with

4

less bits without degrading the perceived quality of the audio

signal.

• The MPEG standard specifies a 32-channels of sub-band

filtering.

5

1.1 Uniform Modulated Filterbank

Polyphase Filterbank

6

Uniform Modulated Filterbank

• A modulated filterbank is composed of analysis branches which

1. modulate the input to center the desired sub-band at DC,

2. lowpass filter the modulated signal to isolate the desired

sub-band, and

3. downsample the lowpass signal.

• The synthesis branches interpolate the sub-band signals by

7

upsampling and lowpass filtering, then modulate each

sub-band back to its original spectral location.

• In an M -branch critically-sampled uniformly-modulated

filterbank, the kth analysis branch extracts the sub-band signal

with center frequency ωk =2π

Mk via modulation and lowpass

filtering with a (one-sided) bandwidth ofπ

Mradians, and then

downsamples the result by factor M .

• The output from the uniform modulated filterbank is

time-domain data of a subband.

8

1.2 Polyphase/DFT Implementation of Uniform

Modulated Filterbank

Uniform Modulated Filterbank

9

The uniform modulated filterbank can be implemented using

polyphase filterbanks and DFTs, resulting in huge computational

savings. Fig. illustrates the equivalent polyphase/DFT structures

for analysis and synthesis.

• The impulse responses of the polyphase filters P`(z) and

P̄`(z)can be defined in the time domain as

p`[m] = p̄`[m] = h[mM + `], where h[n] denotes the impulse

responses of the lowpass filters.

• Recall that the standard implementation performs modulation,

filtering, and downsampling, in that order.

• The polyphase/DFT implementation reverses the order of

these operations; it performs downsampling, then filtering,

then modulation (if we interpret the DFT as a two-dimensional

bank of “modulators”).

10

We derive the polyphase/DFT implementation by exchanging the

order of modulation, filtering, and downsampling.

11

Reversing the modulation and filtering

We start by analyzing the kth filterbank branch, analyzed below.

The first step is to reverse the modulation and filtering operations.

To do this, we define a “modulated filter” Hk(z):

vk[n] =∑

i

h[i]x[n− i]ej 2πM k(n−i) (1)

=

(

∑

i

h[i]e−j2πM kix[n− i]

)

ej2πM kn (2)

=

(

∑

i

hk[i]x[n− i]

)

ej2πM kn (3)

12

• where, hk[i] = h[i]e−j2πM ki is the impulse response of the

modulated filter. The equation above indicates that x[n] is

convolved with the modulated filter and that the filter output

is modulated.

• Now. consider the down sampler. The only modulator outputs

not discarded by the downsampler are those with time index

n = mM . For those outputs, the modulator has the value

ej2πM kmM = 1, and thus it can be ignored. The resulting system

is portrayed as shown in the bottom blockdaigram.

13

14

Reversing the order of filtering anddownsampling.

To apply the Noble identity, we must decompose Hk(z) into a bank

of upsampled polyphase filters. The process to derive polyphase

decimation is explained here:

Hk(z) =∞∑

n=−∞hk[n]z−n =

M−1∑

`=0

∞∑

m=−∞hk[mM + `]z−mM−`

Noting that the `th polyphase filter has impulse response,

hk[mM+`] = h[mM+`]e−j2πM (mM+`) = h[mM+`]e−j

2πM k` = p`[m]e−j

2πM k`

where p`[m] is the `th polyphase filter defined by the original

(unmodulated) lowpass filter H(z) by downsampling M : 1.

15

We now obtain,

Hk(z) =M−1∑

`=0

∞∑

m=−∞p`[m]e−j

2πM k`z−mM−`

=M−1∑

`=0

e−j2πM k`z−`

∞∑

m=−∞p`[m](zM )−m

=M−1∑

`=0

e−j2πM k`z−`P`(z

M ). (4)

16

Derived filterbank structure - downsampler after the polyphase

branches

17

Derived filterbank structure - downsampler before the polyphase

branches

18

• The kth filterbank branch (now containing M polyphase

branches) is illustrated. Because it is a linear operator, the

downsampler can be moved through the adders and the

(time-invariant) scalings e−j2πM k`. Finally, the Noble identity is

employed to exchange the filtering and downsampling.

• Observe that the polyphase outputs fv`[m], ` = 0 · · ·M − 1gare identical for each filterbank branch, while the scalings

fe−j 2πM k`, ` = 0 · · ·M − 1 g are different for each filterbank

branch since they depend on the filterbank branch index k.

19

• Thus, we only need to calculate the polyphase outputs

fv`[m], ` = 0 · · ·M − 1g once. Using these outputs we can

compute the branch outputs via

yk[m] =M−1∑

`=0

v`[m]e−j2πM k` (5)

• From the previous equation it is clear that yk[m] corresponds

to the kth DFT output given the M-point input sequence

fv`[m], ` = 0 · · ·M − 1g. Thus the M filterbank branches can

be computed in parallel by taking an M-point DFT of the M

polyphase outputs as shown.

20

Derived filterbank structure that incorpolates the DFT block

21

1.3 Computational Savings of the Polyphase/DFT

Modulated Filterbank Implementation

Here we consider the analysis bank only; the synthesis bank can be

treated similarly.

standard structure Assume that the lowpass filter H(z) has

impulse response length N . To calculate the sub-band output

vector yk[m], k = 0, · · · ,M − 1 using the standard structure, we

have

1. N multiplications for filter Pi(z) plus one multiply for the

modulator

2. M branches of the filterbank

3. M values to calculate yk[m] for k

Thus, the total number of calculations is M2(N + 1).

22

lowpass/downsampler If we implement the lowpass/downsampler

in each filterbank branch with a polyphase decimator, the

number of multiplications will be,

1. N multiplications for filter Pi(z) for each of M branches,

i.e. N ×M

2. M -point DFT requires M ×M multiplications

Thus, NM +M2 = (M +N)M .

23

FFT If a radix-2 FFT algorithm is used to implement the DFT, we

have approximately,

1. Half size radix-2 FFT performsM

2log2M multiplications.

2. N multiplications for filter Pi(z) for each of M branches,

i.e. N ×M

Thus, the total number of calculations is (MN +M

2log2M).

24

When M = 32 and N = 10, the standard filterbank structure

requires 328704 multiplications, the polyphase/DFT structure

performs 11264 multiplications, and the polyphase/FFT

implementation requires only 400 multiplications.

25

2 The Analysis Subband Filter used by

MPEG-1 Layer-I and II

In MPEG-1 audio encoder, there are two main processing branches

in the block diagram. One branch is the analyzer of psychoacoustic

effects, and the other is the branch of subband analysis filter bank,

which produces the output from each subband (critical band)

frequency shifted to the baseband. Detailed steps of processing in

the branch of subband analysis filter bank is shown in the Figure

below. Corresponding codes in a MATLAB program

Matlab_MPEG_1_2_4.zip are listed in the following. A few lines

from the main program and all of the subroutine

Analysis_subband_filter.m are shown.

26

Block diagram of MPEG1 Layer-II

27

In the flow diagram shown in Fig. 2, the first block shows that a

block of 512 data points are taken into a FIFO (First In First Out)

buffer. The data in the FIFO are processed by a polyphase

filterbank. This FIFO buffer is updated everytime the subband

analysis is completedb by shifting in a set of 32 new data as

illustrated by Fig. ??.

The second block of Fig. 2 applies a low-pass filter function shown

in Fig. ?? to a frame of 512 point data to be sent to the subband

analysis by a polyphase filter bank. This low-pass filter is a band

limiting filter to suppress frequency aliasing. The pass band within

a subband (cut-off frequency) is set to befs64× 0.5824. This filter

function can be designed by the window method of FIR filter

design briefly explained in a section to follow. The designed filter

function is then multiplied by the Blackmann window (not the

Hanning window). The total length of 512 data is then divided into

28

8 segments of 64 data. The alternating sign

f−,+,−,+,−,+,−,+g are attached each segment. This is to shift

the pass-band to the center of a subband.

In order to understand the insight of the processing details, we will

review the concepts of polyphase filter bank and the DCT in the

following sections.

29

Flow Diagram of the MPEG-1 Audio Encoder Layer-I and Layer II

30

Input data for the subband filterbank

31

Window function applied to a frame of 512 point data

32

% Load tables.

[TH, Map, LTq] = Table_absolute_threshold(1, fs, 128); % Threshold in quiet

CB = Table_critical_band_boundaries(1, fs);

C = Table_analysis_window;

% Analysis subband filtering [1, pp. 67].

for i = 0:11,

S = [S; Analysis_subband_filter(x, OFFSET + 32 * i, C)];

end

% -----------------------------------------------

function S = Analysis_subband_filter(Input, n, C)

Common;

nmax = length(Input);

% Check input parameters

if (n + 31 > nmax | n < 1)

error(’Unexpected analysis index.’);

end

% Build an input vector X of 512 elements. The most recent sample

% is at position 512 while the oldest element is at position 1.

% Padd with zeroes if the input signal does not exist.

% ...........................................................

% | 480 samples | 32 samples |

% n-480 n n+31

X = Input(max(1, n - 480):n + 31); % / 32768

33

X = X(:);

X = [zeros(512 - length(X), 1); X];

% Window vector X by vector C. This produces the Z buffer.

Z = X .* C;

% Partial calculation: 64 Yi coefficients

Y = zeros(1, 64);

for i = 1 : 64,

for j = 0 : 7,

Y(i) = Y(i) + Z(i + 64 * j);

end

end

% Calculate the analysis filter bank coefficients

for i = 0 : 31,

for k = 0 : 63,

M(i + 1, k + 1) = cos((2 * i + 1) * (k - 16) * pi / 64);

end

end

% Calculate the 32 subband samples Si

S = zeros(1, 32);

for i = 1 : 32,

for k = 1 : 64,

S(i) = S(i) + M(i, k) * Y(k);

end

end

34

3 Application of Psychoacoustic Principles:

ISO 11172-3 (MPEG-1)

PSYCHOACOUSTIC MODEL 1

• It is useful to consider an example of how the psychoacoustic

principles described thus far are applied in actual coding

algorithms. The ISO/IEC 11172-3 (MPEG-1, layer 1)

psychoacoustic model 1 determines the maximum allowable

quantization noise energy in each critical band such that

quantization noise remains inaudible.

• In one of its modes, the model uses a 512-point DFT for high

resolution spectral analysis (86.13 Hz), then estimates for each

input frame individual simultaneous masking thresholds due to

the presence of tone-like and noise-like maskers in the signal

spectrum. A global masking threshold is then estimated for a

35

subset of the original 256 frequency bins by (power) additive

combination of the tonal and non-tonal individual masking

thresholds.

• This section describes the step-by-step model operations. The

five steps leading to computation of global masking thresholds

are as follows:

1. Spectral Analysis and SPL (Sound Pressure Level)

Normalization

2. Identification of Tonal and Noise Maskers

3. Decimation and Reorganization of Maskers

4. Calculation of Individual Masking Thresholds

5. Calculation of Global Masking Thresholds

36

3.1 Spectral Analysis and SPL Normalization

First, incoming audio samples of b bit integer, s(n), are normalized

according to the FFT length, N , and the number of bits per

sample (signed integer), b, using the relation

x(n) =s(n)

N (2b−1)

Normalization references the power spectrum to a 0-dB maximum.

The normalized input, x(n), is then segmented into 12 ms frames

(512 samples) using a 1/16th overlapped Hann window such that

each frame contains 10.9 ms of new data. A power spectral density

(PSD) estimate, P (k), is then obtained using a 512-point FFT.

X(k) =N−1∑

n=0

x(n)e−j2πnkN

37

X(k) =N−1∑

n=0

x(n)w(n)e−j2πnkN .

The Hanning window (Hann window) defined by

w(n) =1

2

[

1− cos

(

2πn

N

)]

is used to reduce the spectrum leakage from other frequencies to

the analysing frequency.

38

Spectrum of

Rectangular (time) Window

39

Spectrum of the Hanning Window

40

A power spectral density (PSD) estimate, P (k), is then obtained

from X(k) computed by a 512-point FFT (Fast Fourier

Transform), a fast algorithm to compute DFT (Discrete Fourier

Transform). PSD resulting from 512 FFT has 256 spectral

components (harmonics).

P (k) = PN + 10 log10 jX(k)j2 for 0 ≤ k ≤ N

2

where the power normalization term, PN , is the reference sound

pressure level of 96 dB.

41

Problem

Matlab MPEG 1 2 4.zip contains a MATLAB program that sim-

ulates all of MP3 spychoacoustic masking threshold calculations.

A subroutine FFT Analysis.m calculates Power Spectral Density

(PSD). Main program is Test MPEG.m. Apply this program to

a music piece in *.wav of your choice to see its PSD. Slide the

time window of 512 samples to find the first block so that no

zero padding is applied to the analysis. The PSD of “Eine Kleine

Nachtmusik” by Mozart is shown below. The key part of process-

ing in FFT Analysis.m is shown below.

42

% Compute the auditory spectrum using the Fast Fourier Transform.

% The spectrum X is expressed in dB. The size of the transform si 512 and

% is centered on the 384 samples (12 samples per subband) used for the

% subband analysis. The first of the 384 samples is indexed by n:

% ................................................

% | | 384 samples | |

% n-64 n n+383 n+447

% A Hanning window applied before computing the FFT.

%

% Prepare the Hanning window

h = sqrt(8/3) * hanning(FFT_SIZE);

% Power density spectrum

X = max(20 * log10(abs(fft(s .* h)) / FFT_SIZE), MIN_POWER);

% Normalization to the reference sound pressure level of 96 dB

Delta = 96 - max(X);

X = X + Delta;

43

PSD of “Eine Kleine Nachtmusik” by Mozart

44

3.2 Identification of Tonal and Noise Maskers

After PSD estimation and SPL normalization, tonal and non-tonal

masking components are identified.

Tonal maskers

Local maxima in the sample PSD which exceed neighboring

components within a certain bark distance by at least 7 dB are

classified as tonal. Specifically, the tonal set, ST , is defined as

ST =

P (k) such thatP (k) > P (k ± 1)

P (k) > P (k ±∆k) + 7dB

45

where,

∆k ∈

2 2 < k < 63 0.17-5.5 KHz

(2, 3) 63 ≤ k < 127 5.5-11 KHz

(2, · · · , 6) 127 ≤ k < 256 11-20 KHz

Tonal maskers, PTM (k), are computed from the spectral peaks

listed in ST as follows

PTM (k) = 10 log10

+1∑

j=−1

100.1P (k+j) dB

Noise maskers

A single noise masker for each critical band, PNM (k̄), is then

computed from (remaining) spectral lines not within the ±∆k

46

neighborhood of a tonal masker using the sum,

PNM (k̄) = 10 log10

∑

j

100.1P (j) dB

for all P (j) not the member of PTM (k, k ± 1, k ±∆k)

where, k̄ =

u∏

j=l

j

1u−l+1

and l and u are the lower and upper

spectral line boundaries of the critical band, respectively.

47

(1) local maxima

48

(2) tonal components

49

(3) tonal and non-tonal components of Eine Kleine Nachtmusik

50

Problem

A subroutine Find tonal components.m contained in the

MP3 spychoacoustic masking simulation program Mat-

lab MPEG 1 2 4.zip first calculates the local maxima of

Power Spectral Density (PSD). From the obtained local maxima

of PSD, tonal components are calculated based on Equations

described above. Then, non-tonal components and the fre-

quencies of the critical band are calculated. Main program

is Test MPEG.m. Apply this program to a music piece in

*.wav chosen in the previous Problem to show the 3 figures

generated by Find tonal components.m, (1) local maxima, (2)

tonal components, and (3) tonal and non-tonal components.

51

3.3 Decimation and Reorganization of Maskers

In this step, the number of maskers is reduced using two criteria.

First, any tonal or noise maskers below the absolute threshold are

discarded, i.e., only maskers which satisfy

PTM,NM (k) ≥ Tq(k)

are retained, where Tq(k) is the SPL of the threshold in quiet at

spectral line k. Next, a sliding 0.5 Bark-wide window is used to

replace any pair of maskers occurring within a distance of 0.5 Bark

by the stronger of the two.

After the sliding window procedure, masker frequency bins are

52

reorganized according to the subsampling scheme,

PTM,NM (i) =

PTM,NM (k) if i = k

0 if i 6= k

The net effect is 2:1 decimation of masker bins in critical bands

18-22 and 4:1 decimation of masker bins in critical bands 22-25 ,

with no loss of masking components. This procedure reduces the

total number of tone and noise masker frequency bins under

consideration from 256 to 106. An example of decimation for the

equal SPL is shown in the table below.

53

k i decimate

50 50 keep

51 52 zero

52 52 keep

100 100 keep

101 104 zero

102 104 zero

103 104 zero

104 104 keep

54

Problem

A subroutine Decimation.m— contained in the MP3 spychoacous-

tic masking simulation program Matlab MPEG 1 2 4.zip does all

processes of decimination described in this sub-section. Apply

this program to a music piece in *.wav chosen in the previous

Problem to see if any of SPL’s are elimnated due to (1) any tonal

or noise maskers are below the absolute threshold, (2) any pair of

maskers occurring within a distance of 0.5 Bark is replaced by the

stronger of the two. (3) 2:1 decimation of masker bins in critical

bands 18-22 and 4:1 decimation of masker bins in critical bands

22-25.

55

Tonal and non-tonal maskers after decimation. Only one non-tonal

masker SPL under the absolute threshold was eliminated.

56

3.4 Calculation of Individual Masking Thresholds

Having obtained a decimated set of tonal and noise maskers,

individual tone and noise masking thresholds are computed next.

Each individual threshold represents a masking contribution at

frequency bin i due to the tone or noise masker located at bin j

(reorganized during step 3). Tonal masker thresholds, TTM (i, j),

are given by

TTM (i, j) = PTM (j)− 0.275z(j) + SF (i, j)− 6.025 dB

where PTM (j) denotes the SPL of the tonal masker in frequency

bin j, z(j) denotes the Bark frequency of bin j,

57

and the spread of masking from masker bin j to maskee bin i,

SF (i, j), is modeled by the expression,

SF (i, j) =

17∆z − 0.4PTM (j) + 11 −3 ≤ ∆z < −1

(0.4PTM (j) + 6)∆z −1 ≤ ∆z < 0

−17∆z 0 ≤ ∆z < 1

(0.15PTM (j)− 17)∆z − 0.15PTM (j) 1 ≤ ∆z < 8

dB

58

Prototype spreading functions at z=10 as a function of masker level

59

SF (i, j) is a piecewise linear function of masker level, PTM (j), and

Bark maskee-masker separation, ∆z = z(i)− z(j). SF (i, j)

approximates the basilar spreading (excitation pattern) given. As

shown in the figure, the slope of TTM (i, j), decreases with

increasing masker level. This is a reflection of psychophysical test

results, which have demonstrated that the ear’s frequency

selectivity decreases as stimulus levels increase. It is also noted

here that the spread of masking in this particular model is

constrained to a 10-Bark neighborhood for computational

efficiency. This simplifying assumption is reasonable given the very

low masking levels which occur in the tails of the basilar excitation

patterns modeled by SF (i, j).

60

Individual noise masker thresholds, TNM (i, j), are given by

TNM (i, j) = PNM (j)− 0.175z(j) + SF (i, j)− 2.025 dB

where TNM (i, j) denotes the SPL of the noise masker in frequency

bin j, z(j) denotes the Bark frequency of bin j, and SF (i, j) is

obtained by replacing PTM (j) with PNM (j).

Problem

A subroutine Individual masking thresholds.m contained in

the MP3 spychoacoustic masking simulation program Mat-

lab MPEG 1 2 4.zip calculates individaul masking thresholds of

tonal maskers TTM (i, j), and non-tonal maskers TNM (i, j) using

the spreading function SF (i, j). Apply this program to a music

piece in *.wav chosen in the previous Problem to plot the indivi-

daul masking thresholds of a frame.

61

3.5 Calculation of Global Masking Thresholds

In this step, individual masking thresholds are combined to

estimate a global masking threshold for each frequency bin in the

subset given by Eq. 3.4. The model assumes that masking effects

are additive. The global masking threshold, Tg(i), is therefore

obtained by computing the sum,

Tg(i) = 10 log10

(

100.1Tq(i) +L∑

l=1

100.1TTM (i,l) +M∑

m=1

100.1TNM (i,m)

)

dB

where Tq(i) is the absolute hearing threshold for frequency bin i,

TTM (i, l) and TNM (i,m) are the individual masking thresholds,

and L and M are the number of tonal and noise maskers,

respectively, identified previously.

62

In other words, the global threshold for each frequency bin

represents a signal dependent, power additive modification of the

absolute threshold due to the basilar spread of all tonal and noise

maskers in the signal power spectrum. The next Fig. shows global

masking threshold obtained by adding the power of the individual

tonal and noise maskers to the absolute threshold in quiet.

63

Individaul masking thresholds for both tonal and non-tonal

maskers. The global masking threshold is the sum of all individual

masking thresholds.

64

4 End

Rµν −1

2Rδµν =

8πG

c4Tµν

Here Tµν is tensor of energy momentum.

black blue

red magenta

green cyan

yellow

65

Documents

mpeg1 audio