Upload
avik-chakraborty
View
409
Download
5
Embed Size (px)
Citation preview
Multimedia Signals andSystems
MP3 - Mpeg 1,2 layer 1,2,3Polyphase Filterbank
Kunio Takaya
Electrical and Computer Engineering
University of Saskatchewan
March 31, 2008
1
“A review of algorithms for perceptual coding of digital audio
signals”
Painter, T. Spanias, A.
Dept. of Electr. Eng., Arizona State Univ., Tempe, AZ;
http://ieeexplore.ieee.org/iel3/4961/13644/00628010.pdf?arnumber=628010
MP3’ Tech - Encoding engines source codes:
http://www.mp3-tech.org/programmer/encoding.html
“ECE-700 Filterbank Notes.”, Why Filterbanks? Sub-band
Processing:
Phil Schniter, Ohio State Univ. March 10, 2008. 1
http://www.ece.osu.edu/˜ schniter/ee700/handouts/filterbanks.pdf
** Go to full-screen mode now by hitting CTRL-L
2
1 Polyphase Filter Bank
References
1. Phil Schneiter, ECE-700 Filterbank Notes
2. Davis Yen Pan, Digital Audio Compression
3. Davis Pan, A Tutorial on MPEG/Audio Compression
4. CD 11172-3 CODING OF MOVING PICTURES AND
ASSOCIATED AUDIO FOR DIGITAL STORAGE MEDIA AT
UP TO ABOUT 1.5 MBIT/s Part 3 AUDIO
5. Jong-Hwa Kim, Lossless Wideband Audio Compression:
Prediction and Transform, Ph.D. Thesis
3
• In MPEG audio coding, a psychoacoustic model is used to
decide how much quantization error can be tolerated in each
sub-band, while signals below the hearing threshold of a
human listener is discarded.
• In the sub-bands that can tolerate more error, less bits are
used for coding. The quantized subband signals can then be
decoded and recombined to reconstruct (an approximate
version of) the input signal.
• Such processing allows, on average, a 12-to-1 reduction in bit
rate while still maintaining CD quality audio.
• The psychoacoustic model takes into account the spectral
masking phenomenon of the human ear, which says that high
energy in one spectral region will limit the ear’s ability to hear
details in nearby spectral regions. Therefore, when the energy
in one sub-band is high, nearby subbands can be coded with
4
less bits without degrading the perceived quality of the audio
signal.
• The MPEG standard specifies a 32-channels of sub-band
filtering.
5
1.1 Uniform Modulated Filterbank
Polyphase Filterbank
6
Uniform Modulated Filterbank
• A modulated filterbank is composed of analysis branches which
1. modulate the input to center the desired sub-band at DC,
2. lowpass filter the modulated signal to isolate the desired
sub-band, and
3. downsample the lowpass signal.
• The synthesis branches interpolate the sub-band signals by
7
upsampling and lowpass filtering, then modulate each
sub-band back to its original spectral location.
• In an M -branch critically-sampled uniformly-modulated
filterbank, the kth analysis branch extracts the sub-band signal
with center frequency ωk =2π
Mk via modulation and lowpass
filtering with a (one-sided) bandwidth ofπ
Mradians, and then
downsamples the result by factor M .
• The output from the uniform modulated filterbank is
time-domain data of a subband.
8
1.2 Polyphase/DFT Implementation of Uniform
Modulated Filterbank
Uniform Modulated Filterbank
9
The uniform modulated filterbank can be implemented using
polyphase filterbanks and DFTs, resulting in huge computational
savings. Fig. illustrates the equivalent polyphase/DFT structures
for analysis and synthesis.
• The impulse responses of the polyphase filters P`(z) and
P̄`(z)can be defined in the time domain as
p`[m] = p̄`[m] = h[mM + `], where h[n] denotes the impulse
responses of the lowpass filters.
• Recall that the standard implementation performs modulation,
filtering, and downsampling, in that order.
• The polyphase/DFT implementation reverses the order of
these operations; it performs downsampling, then filtering,
then modulation (if we interpret the DFT as a two-dimensional
bank of “modulators”).
10
We derive the polyphase/DFT implementation by exchanging the
order of modulation, filtering, and downsampling.
11
Reversing the modulation and filtering
We start by analyzing the kth filterbank branch, analyzed below.
The first step is to reverse the modulation and filtering operations.
To do this, we define a “modulated filter” Hk(z):
vk[n] =∑
i
h[i]x[n− i]ej 2πM k(n−i) (1)
=
(
∑
i
h[i]e−j2πM kix[n− i]
)
ej2πM kn (2)
=
(
∑
i
hk[i]x[n− i]
)
ej2πM kn (3)
12
• where, hk[i] = h[i]e−j2πM ki is the impulse response of the
modulated filter. The equation above indicates that x[n] is
convolved with the modulated filter and that the filter output
is modulated.
• Now. consider the down sampler. The only modulator outputs
not discarded by the downsampler are those with time index
n = mM . For those outputs, the modulator has the value
ej2πM kmM = 1, and thus it can be ignored. The resulting system
is portrayed as shown in the bottom blockdaigram.
13
14
Reversing the order of filtering anddownsampling.
To apply the Noble identity, we must decompose Hk(z) into a bank
of upsampled polyphase filters. The process to derive polyphase
decimation is explained here:
Hk(z) =∞∑
n=−∞hk[n]z−n =
M−1∑
`=0
∞∑
m=−∞hk[mM + `]z−mM−`
Noting that the `th polyphase filter has impulse response,
hk[mM+`] = h[mM+`]e−j2πM (mM+`) = h[mM+`]e−j
2πM k` = p`[m]e−j
2πM k`
where p`[m] is the `th polyphase filter defined by the original
(unmodulated) lowpass filter H(z) by downsampling M : 1.
15
We now obtain,
Hk(z) =M−1∑
`=0
∞∑
m=−∞p`[m]e−j
2πM k`z−mM−`
=M−1∑
`=0
e−j2πM k`z−`
∞∑
m=−∞p`[m](zM )−m
=M−1∑
`=0
e−j2πM k`z−`P`(z
M ). (4)
16
Derived filterbank structure - downsampler after the polyphase
branches
17
Derived filterbank structure - downsampler before the polyphase
branches
18
• The kth filterbank branch (now containing M polyphase
branches) is illustrated. Because it is a linear operator, the
downsampler can be moved through the adders and the
(time-invariant) scalings e−j2πM k`. Finally, the Noble identity is
employed to exchange the filtering and downsampling.
• Observe that the polyphase outputs fv`[m], ` = 0 · · ·M − 1gare identical for each filterbank branch, while the scalings
fe−j 2πM k`, ` = 0 · · ·M − 1 g are different for each filterbank
branch since they depend on the filterbank branch index k.
19
• Thus, we only need to calculate the polyphase outputs
fv`[m], ` = 0 · · ·M − 1g once. Using these outputs we can
compute the branch outputs via
yk[m] =M−1∑
`=0
v`[m]e−j2πM k` (5)
• From the previous equation it is clear that yk[m] corresponds
to the kth DFT output given the M-point input sequence
fv`[m], ` = 0 · · ·M − 1g. Thus the M filterbank branches can
be computed in parallel by taking an M-point DFT of the M
polyphase outputs as shown.
20
Derived filterbank structure that incorpolates the DFT block
21
1.3 Computational Savings of the Polyphase/DFT
Modulated Filterbank Implementation
Here we consider the analysis bank only; the synthesis bank can be
treated similarly.
standard structure Assume that the lowpass filter H(z) has
impulse response length N . To calculate the sub-band output
vector yk[m], k = 0, · · · ,M − 1 using the standard structure, we
have
1. N multiplications for filter Pi(z) plus one multiply for the
modulator
2. M branches of the filterbank
3. M values to calculate yk[m] for k
Thus, the total number of calculations is M2(N + 1).
22
lowpass/downsampler If we implement the lowpass/downsampler
in each filterbank branch with a polyphase decimator, the
number of multiplications will be,
1. N multiplications for filter Pi(z) for each of M branches,
i.e. N ×M
2. M -point DFT requires M ×M multiplications
Thus, NM +M2 = (M +N)M .
23
FFT If a radix-2 FFT algorithm is used to implement the DFT, we
have approximately,
1. Half size radix-2 FFT performsM
2log2M multiplications.
2. N multiplications for filter Pi(z) for each of M branches,
i.e. N ×M
Thus, the total number of calculations is (MN +M
2log2M).
24
When M = 32 and N = 10, the standard filterbank structure
requires 328704 multiplications, the polyphase/DFT structure
performs 11264 multiplications, and the polyphase/FFT
implementation requires only 400 multiplications.
25
2 The Analysis Subband Filter used by
MPEG-1 Layer-I and II
In MPEG-1 audio encoder, there are two main processing branches
in the block diagram. One branch is the analyzer of psychoacoustic
effects, and the other is the branch of subband analysis filter bank,
which produces the output from each subband (critical band)
frequency shifted to the baseband. Detailed steps of processing in
the branch of subband analysis filter bank is shown in the Figure
below. Corresponding codes in a MATLAB program
Matlab_MPEG_1_2_4.zip are listed in the following. A few lines
from the main program and all of the subroutine
Analysis_subband_filter.m are shown.
26
Block diagram of MPEG1 Layer-II
27
In the flow diagram shown in Fig. 2, the first block shows that a
block of 512 data points are taken into a FIFO (First In First Out)
buffer. The data in the FIFO are processed by a polyphase
filterbank. This FIFO buffer is updated everytime the subband
analysis is completedb by shifting in a set of 32 new data as
illustrated by Fig. ??.
The second block of Fig. 2 applies a low-pass filter function shown
in Fig. ?? to a frame of 512 point data to be sent to the subband
analysis by a polyphase filter bank. This low-pass filter is a band
limiting filter to suppress frequency aliasing. The pass band within
a subband (cut-off frequency) is set to befs64× 0.5824. This filter
function can be designed by the window method of FIR filter
design briefly explained in a section to follow. The designed filter
function is then multiplied by the Blackmann window (not the
Hanning window). The total length of 512 data is then divided into
28
8 segments of 64 data. The alternating sign
f−,+,−,+,−,+,−,+g are attached each segment. This is to shift
the pass-band to the center of a subband.
In order to understand the insight of the processing details, we will
review the concepts of polyphase filter bank and the DCT in the
following sections.
29
Flow Diagram of the MPEG-1 Audio Encoder Layer-I and Layer II
30
Input data for the subband filterbank
31
Window function applied to a frame of 512 point data
32
% Load tables.
[TH, Map, LTq] = Table_absolute_threshold(1, fs, 128); % Threshold in quiet
CB = Table_critical_band_boundaries(1, fs);
C = Table_analysis_window;
% Analysis subband filtering [1, pp. 67].
for i = 0:11,
S = [S; Analysis_subband_filter(x, OFFSET + 32 * i, C)];
end
% -----------------------------------------------
function S = Analysis_subband_filter(Input, n, C)
Common;
nmax = length(Input);
% Check input parameters
if (n + 31 > nmax | n < 1)
error(’Unexpected analysis index.’);
end
% Build an input vector X of 512 elements. The most recent sample
% is at position 512 while the oldest element is at position 1.
% Padd with zeroes if the input signal does not exist.
% ...........................................................
% | 480 samples | 32 samples |
% n-480 n n+31
X = Input(max(1, n - 480):n + 31); % / 32768
33
X = X(:);
X = [zeros(512 - length(X), 1); X];
% Window vector X by vector C. This produces the Z buffer.
Z = X .* C;
% Partial calculation: 64 Yi coefficients
Y = zeros(1, 64);
for i = 1 : 64,
for j = 0 : 7,
Y(i) = Y(i) + Z(i + 64 * j);
end
end
% Calculate the analysis filter bank coefficients
for i = 0 : 31,
for k = 0 : 63,
M(i + 1, k + 1) = cos((2 * i + 1) * (k - 16) * pi / 64);
end
end
% Calculate the 32 subband samples Si
S = zeros(1, 32);
for i = 1 : 32,
for k = 1 : 64,
S(i) = S(i) + M(i, k) * Y(k);
end
end
34
3 Application of Psychoacoustic Principles:
ISO 11172-3 (MPEG-1)
PSYCHOACOUSTIC MODEL 1
• It is useful to consider an example of how the psychoacoustic
principles described thus far are applied in actual coding
algorithms. The ISO/IEC 11172-3 (MPEG-1, layer 1)
psychoacoustic model 1 determines the maximum allowable
quantization noise energy in each critical band such that
quantization noise remains inaudible.
• In one of its modes, the model uses a 512-point DFT for high
resolution spectral analysis (86.13 Hz), then estimates for each
input frame individual simultaneous masking thresholds due to
the presence of tone-like and noise-like maskers in the signal
spectrum. A global masking threshold is then estimated for a
35
subset of the original 256 frequency bins by (power) additive
combination of the tonal and non-tonal individual masking
thresholds.
• This section describes the step-by-step model operations. The
five steps leading to computation of global masking thresholds
are as follows:
1. Spectral Analysis and SPL (Sound Pressure Level)
Normalization
2. Identification of Tonal and Noise Maskers
3. Decimation and Reorganization of Maskers
4. Calculation of Individual Masking Thresholds
5. Calculation of Global Masking Thresholds
36
3.1 Spectral Analysis and SPL Normalization
First, incoming audio samples of b bit integer, s(n), are normalized
according to the FFT length, N , and the number of bits per
sample (signed integer), b, using the relation
x(n) =s(n)
N (2b−1)
Normalization references the power spectrum to a 0-dB maximum.
The normalized input, x(n), is then segmented into 12 ms frames
(512 samples) using a 1/16th overlapped Hann window such that
each frame contains 10.9 ms of new data. A power spectral density
(PSD) estimate, P (k), is then obtained using a 512-point FFT.
X(k) =N−1∑
n=0
x(n)e−j2πnkN
37
X(k) =N−1∑
n=0
x(n)w(n)e−j2πnkN .
The Hanning window (Hann window) defined by
w(n) =1
2
[
1− cos
(
2πn
N
)]
is used to reduce the spectrum leakage from other frequencies to
the analysing frequency.
38
Spectrum of
Rectangular (time) Window
39
Spectrum of the Hanning Window
40
A power spectral density (PSD) estimate, P (k), is then obtained
from X(k) computed by a 512-point FFT (Fast Fourier
Transform), a fast algorithm to compute DFT (Discrete Fourier
Transform). PSD resulting from 512 FFT has 256 spectral
components (harmonics).
P (k) = PN + 10 log10 jX(k)j2 for 0 ≤ k ≤ N
2
where the power normalization term, PN , is the reference sound
pressure level of 96 dB.
41
Problem
Matlab MPEG 1 2 4.zip contains a MATLAB program that sim-
ulates all of MP3 spychoacoustic masking threshold calculations.
A subroutine FFT Analysis.m calculates Power Spectral Density
(PSD). Main program is Test MPEG.m. Apply this program to
a music piece in *.wav of your choice to see its PSD. Slide the
time window of 512 samples to find the first block so that no
zero padding is applied to the analysis. The PSD of “Eine Kleine
Nachtmusik” by Mozart is shown below. The key part of process-
ing in FFT Analysis.m is shown below.
42
% Compute the auditory spectrum using the Fast Fourier Transform.
% The spectrum X is expressed in dB. The size of the transform si 512 and
% is centered on the 384 samples (12 samples per subband) used for the
% subband analysis. The first of the 384 samples is indexed by n:
% ................................................
% | | 384 samples | |
% n-64 n n+383 n+447
% A Hanning window applied before computing the FFT.
%
% Prepare the Hanning window
h = sqrt(8/3) * hanning(FFT_SIZE);
% Power density spectrum
X = max(20 * log10(abs(fft(s .* h)) / FFT_SIZE), MIN_POWER);
% Normalization to the reference sound pressure level of 96 dB
Delta = 96 - max(X);
X = X + Delta;
43
PSD of “Eine Kleine Nachtmusik” by Mozart
44
3.2 Identification of Tonal and Noise Maskers
After PSD estimation and SPL normalization, tonal and non-tonal
masking components are identified.
Tonal maskers
Local maxima in the sample PSD which exceed neighboring
components within a certain bark distance by at least 7 dB are
classified as tonal. Specifically, the tonal set, ST , is defined as
ST =
P (k) such thatP (k) > P (k ± 1)
P (k) > P (k ±∆k) + 7dB
45
where,
∆k ∈
2 2 < k < 63 0.17-5.5 KHz
(2, 3) 63 ≤ k < 127 5.5-11 KHz
(2, · · · , 6) 127 ≤ k < 256 11-20 KHz
Tonal maskers, PTM (k), are computed from the spectral peaks
listed in ST as follows
PTM (k) = 10 log10
+1∑
j=−1
100.1P (k+j) dB
Noise maskers
A single noise masker for each critical band, PNM (k̄), is then
computed from (remaining) spectral lines not within the ±∆k
46
neighborhood of a tonal masker using the sum,
PNM (k̄) = 10 log10
∑
j
100.1P (j) dB
for all P (j) not the member of PTM (k, k ± 1, k ±∆k)
where, k̄ =
u∏
j=l
j
1u−l+1
and l and u are the lower and upper
spectral line boundaries of the critical band, respectively.
47
(1) local maxima
48
(2) tonal components
49
(3) tonal and non-tonal components of Eine Kleine Nachtmusik
50
Problem
A subroutine Find tonal components.m contained in the
MP3 spychoacoustic masking simulation program Mat-
lab MPEG 1 2 4.zip first calculates the local maxima of
Power Spectral Density (PSD). From the obtained local maxima
of PSD, tonal components are calculated based on Equations
described above. Then, non-tonal components and the fre-
quencies of the critical band are calculated. Main program
is Test MPEG.m. Apply this program to a music piece in
*.wav chosen in the previous Problem to show the 3 figures
generated by Find tonal components.m, (1) local maxima, (2)
tonal components, and (3) tonal and non-tonal components.
51
3.3 Decimation and Reorganization of Maskers
In this step, the number of maskers is reduced using two criteria.
First, any tonal or noise maskers below the absolute threshold are
discarded, i.e., only maskers which satisfy
PTM,NM (k) ≥ Tq(k)
are retained, where Tq(k) is the SPL of the threshold in quiet at
spectral line k. Next, a sliding 0.5 Bark-wide window is used to
replace any pair of maskers occurring within a distance of 0.5 Bark
by the stronger of the two.
After the sliding window procedure, masker frequency bins are
52
reorganized according to the subsampling scheme,
PTM,NM (i) =
PTM,NM (k) if i = k
0 if i 6= k
The net effect is 2:1 decimation of masker bins in critical bands
18-22 and 4:1 decimation of masker bins in critical bands 22-25 ,
with no loss of masking components. This procedure reduces the
total number of tone and noise masker frequency bins under
consideration from 256 to 106. An example of decimation for the
equal SPL is shown in the table below.
53
k i decimate
50 50 keep
51 52 zero
52 52 keep
100 100 keep
101 104 zero
102 104 zero
103 104 zero
104 104 keep
54
Problem
A subroutine Decimation.m— contained in the MP3 spychoacous-
tic masking simulation program Matlab MPEG 1 2 4.zip does all
processes of decimination described in this sub-section. Apply
this program to a music piece in *.wav chosen in the previous
Problem to see if any of SPL’s are elimnated due to (1) any tonal
or noise maskers are below the absolute threshold, (2) any pair of
maskers occurring within a distance of 0.5 Bark is replaced by the
stronger of the two. (3) 2:1 decimation of masker bins in critical
bands 18-22 and 4:1 decimation of masker bins in critical bands
22-25.
55
Tonal and non-tonal maskers after decimation. Only one non-tonal
masker SPL under the absolute threshold was eliminated.
56
3.4 Calculation of Individual Masking Thresholds
Having obtained a decimated set of tonal and noise maskers,
individual tone and noise masking thresholds are computed next.
Each individual threshold represents a masking contribution at
frequency bin i due to the tone or noise masker located at bin j
(reorganized during step 3). Tonal masker thresholds, TTM (i, j),
are given by
TTM (i, j) = PTM (j)− 0.275z(j) + SF (i, j)− 6.025 dB
where PTM (j) denotes the SPL of the tonal masker in frequency
bin j, z(j) denotes the Bark frequency of bin j,
57
and the spread of masking from masker bin j to maskee bin i,
SF (i, j), is modeled by the expression,
SF (i, j) =
17∆z − 0.4PTM (j) + 11 −3 ≤ ∆z < −1
(0.4PTM (j) + 6)∆z −1 ≤ ∆z < 0
−17∆z 0 ≤ ∆z < 1
(0.15PTM (j)− 17)∆z − 0.15PTM (j) 1 ≤ ∆z < 8
dB
58
Prototype spreading functions at z=10 as a function of masker level
59
SF (i, j) is a piecewise linear function of masker level, PTM (j), and
Bark maskee-masker separation, ∆z = z(i)− z(j). SF (i, j)
approximates the basilar spreading (excitation pattern) given. As
shown in the figure, the slope of TTM (i, j), decreases with
increasing masker level. This is a reflection of psychophysical test
results, which have demonstrated that the ear’s frequency
selectivity decreases as stimulus levels increase. It is also noted
here that the spread of masking in this particular model is
constrained to a 10-Bark neighborhood for computational
efficiency. This simplifying assumption is reasonable given the very
low masking levels which occur in the tails of the basilar excitation
patterns modeled by SF (i, j).
60
Individual noise masker thresholds, TNM (i, j), are given by
TNM (i, j) = PNM (j)− 0.175z(j) + SF (i, j)− 2.025 dB
where TNM (i, j) denotes the SPL of the noise masker in frequency
bin j, z(j) denotes the Bark frequency of bin j, and SF (i, j) is
obtained by replacing PTM (j) with PNM (j).
Problem
A subroutine Individual masking thresholds.m contained in
the MP3 spychoacoustic masking simulation program Mat-
lab MPEG 1 2 4.zip calculates individaul masking thresholds of
tonal maskers TTM (i, j), and non-tonal maskers TNM (i, j) using
the spreading function SF (i, j). Apply this program to a music
piece in *.wav chosen in the previous Problem to plot the indivi-
daul masking thresholds of a frame.
61
3.5 Calculation of Global Masking Thresholds
In this step, individual masking thresholds are combined to
estimate a global masking threshold for each frequency bin in the
subset given by Eq. 3.4. The model assumes that masking effects
are additive. The global masking threshold, Tg(i), is therefore
obtained by computing the sum,
Tg(i) = 10 log10
(
100.1Tq(i) +L∑
l=1
100.1TTM (i,l) +M∑
m=1
100.1TNM (i,m)
)
dB
where Tq(i) is the absolute hearing threshold for frequency bin i,
TTM (i, l) and TNM (i,m) are the individual masking thresholds,
and L and M are the number of tonal and noise maskers,
respectively, identified previously.
62
In other words, the global threshold for each frequency bin
represents a signal dependent, power additive modification of the
absolute threshold due to the basilar spread of all tonal and noise
maskers in the signal power spectrum. The next Fig. shows global
masking threshold obtained by adding the power of the individual
tonal and noise maskers to the absolute threshold in quiet.
63
Individaul masking thresholds for both tonal and non-tonal
maskers. The global masking threshold is the sum of all individual
masking thresholds.
64
4 End
Rµν −1
2Rδµν =
8πG
c4Tµν
Here Tµν is tensor of energy momentum.
black blue
red magenta
green cyan
yellow
65