Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
CHANGING THE MFCC INTENSITY FUNCTION
Abstract of
a thesis presented to the Faculty
of the University at Albany, State University of New York
in partial fulfillment of the requirements
for the degree of
Master of Arts
College of Arts & Sciences
Department of Mathematics & Statistics
Nicholas W. Jarrett
2010
ABSTRACT
Mel-Frequency Cepstral Coefficients (MFCCs) are commonly used for a variety of
problems in signal processing. There can be limitations to this approach when you at-
tempt to model something which cannot be easily discriminated with the human ear.
The MFCC intensity function is: 1125 · ln(1 + f
700
), where f denotes frequency. This
thesis investigates the results that can be obtained with a more objective intensity
function to generate coefficients. These coefficients will be referred to as NMFCCs.
The new intensity function was obtained by using regression techniques to ap-
proximate the sorted frequencies of a collection of signals. The NMFCCs were used
in lieu of the MFCCs in an existing algorithm which was designed to discriminate
between the A,L,Q, and P keys using the sounds produced by striking them. The
results obtained through the use of MFCCs in this algorithm were used as a baseline
to assess the performance of the NMFCCs. With the substitution of NMFCCs for
MFCCs into the algorithm, the accuracy increased by approximately 30.51%.
Each problem has a different family of characteristic frequencies. The imposed
intensity function can either obfuscate or illuminate them. While these results do not
prove that objective intensity functions are superior to the traditional MFCCs, they
provide an example where the NMFCCs reveal more important characteristics than
the MFCCs.
ii
CHANGING THE MFCC INTENSITY FUNCTION
A thesis presented to the Faculty
of the University at Albany, State University of New York
in partial fulfillment of the requirements
for the degree of
Master of Arts
College of Arts & Sciences
Department of Mathematics & Statistics
Nicholas W. Jarrett
2010
ACKNOWLEDGEMENTS
“The world we have created is a product of our thinking;
it cannot be changed without changing our thinking.”
- Albert Einstein
Thank you to all of the inspirational people who have opened my understanding
of the world.
My thesis advisor, Dr. Carlos Rodriguez truly redefined my perception of statis-
tics. It’s my life’s passion, and his enthusiasm for Bayesian statistics has truly inspired
me. The best thing that’s happened to me here at the University at Albany, was be-
ing lucky enough to be in his classroom.
Dr. Karin Reinhold is my graduate advisor. Professor Reinhold has always been
someone I could go to with my troubles from choosing the classes which will most
benefit my subject area to just needing someone to talk to. She is truly an amazing
individual.
I have given several presentations for Dr. Martin Hildebrand’s master’s seminar.
His advice has helped me to come to where I am now, particularly in my abilities as
an instructor. He taught me something invaluable which I know no textbook, only
his life experience, could have taught me.
I would especially like to thank my good friend, Matthew Lester. We spent
countless nights working on problems together with little to no sleep. A lot of this
thesis would be radically different without his input.
iv
TABLE OF CONTENTS
ABSTRACT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
TABLE OF CONTENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
CHAPTER
1 INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 ORIGIN OF THE MFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 SHORT-TIME FOURIER TRANSFORMS. . . . . . . . . . . . . . . . . . . . . . . . . 5
4 CONSTRUCTION OF THE MFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
5 AUDITORY FEATURE EXTRACTION ON A COMPUTER. . . .11
6 MFCCS IN ACTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 THE FOUR KEY CLASSIFICATION PROBLEM. . . . . . . . . . . . . . . . . .14
8 STRUCTURE IMPOSED BY MFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
9 CONSTRUCTION OF THE NMFCCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20
10 PERFORMANCE & CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
11 TUNER MFCC CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
v
1 Introduction
Mel Frequency Cepstral Coefficients (MFCCs) are used in many different areas
of signal processing. They are very similar in principle with the human ear, and are
especially good for speech recognition. MFCCs are defined in part by the intensity
function: 1125 · ln(1 + f
700
), where f denotes frequency. This thesis investigates the
results that can be obtained by using a more objective intensity function to generate
coefficients. These coefficients will be referred to as NMFCCs.
The new intensity function was obtained by using regression techniques to ap-
proximate the sorted frequencies of a collection of signals. The NMFCCs were used
in lieu of the MFCCs in an existing algorithm which was designed to discriminate
between the A,L,Q, and P keyboard keys using the sounds produced by striking them.
The results obtained through the use of MFCCs in this algorithm were used as a base-
line to assess the performance of the NMFCCs. With the substitution of NMFCCs
for MFCCs into the algorithm, the accuracy increased by approximately 30.51%.
All programming for this thesis was done for the R statistical package. The R
packages e1071, tuneR, and DAAG were used throughout this analysis.
This thesis was typeset with LATEX. Utilized packages and style files include
graphicx, geometry, setspace, and amssymb.
1
2 Origin of the MFCCs
The size of data sets for sound files are very large. In order to be able to analyze
the data, it must first be cut down to a reasonable size. The MFCCs are often used
for speech recognition, and essentially mimic the functionality the human ear. The
physiology of the human ear is used to motivate a solution for the data size problem.
Concepts which are at the heart of MFCCs will be presented in this section in their
relevance for the human ear as well.
One of the challenges in audio and visual analysis is simply to find a method
to deal with the sheer size of the dataset. Humans have been very successful in
performing discrimination for these kinds of problems: both in accuracy and speed.
Consequentially, many techniques directed at understanding such phenomena have
been developed to mimic human physiology.
At first, it may not be clear that characteristic feature extraction from the raw
data is necessary. Humans don’t record a list of amplitudes when we hear a sound,
so we’re not aware of the volume of data encoded in sound. For standard sampling
rates, the text file of raw amplitudes associated with 30 seconds of sound can be
over a gigabyte in size. Not only does our ear filter the sound significantly, but our
brain passes the information through filters as well. When standard speech is passed
through even simple audio filters, it can become unrecognizable. However, when the
listener is told what is being said before hand, they often become able to hear it. This
is an example of how the brain’s expectations place filters on auditory data. More
than this though, the information arriving at the brain has already been cut down
by the ear. This is supported by both anatomical and physiological reasons as well
as experimental evidence relating to the perceptibility of slight differences in sound
frequencies.
2
The human ear generally consists of three sections: outer, middle, and inner. See
Figure 1. For the purpose at hand, the inner is the most important. Sound waves are
collected in the outer ear and channeled in through the middle ear. They enter the
inner ear via the oval window. The cochlea, located in the inner ear, is essentially the
auditory feature extractor for the human body. Figure 2 gives a frequency-location
breakdown along the basilar membrane. The amplitude of a wave increases as it
moves along the tube until it hits its maximum. At this point it decays rapidly.
Figure 1: The Human Ear [6]
Figure 2: Basilar Membrane [10]
3
We are able to perceive sounds by the use of nerves located throughout the basilar
membrane which sense the point at which rapid decay begins and the intensity at it.
The idea is basically that there are holes in our ability to determine precise location.
Just as a mosquito can land on your arm without you feeling it, we may be unable
to precisely determine the exact location at which a wave begins to decay along the
basilar membrane. As a result, our ability to distinguish between sounds in these
regions may be diminished or even non-existent. This phenomenon is characterized
by considering critical bands, centered at critical frequencies, over which we are unable
to tell one frequency from another.
4
3 Short-Time Fourier Transforms (STFTs)
The MFCCs essentially transform and regroup the frequencies that make up
a sound wave; however, most waves are stored by listing their amplitudes, not by
their frequency decomposition. Therefore, as a prerequisite to the extraction of the
MFCCs, the raw data must be transformed into the frequency domain.
Short-Time Fourier Transforms (STFTs) are used to perform this transforma-
tion. Some of the theory and math behind STFTs will be presented in this section.
Understanding the general form of the STFT will be helpful in understanding the
process by which MFCCs are computed.
Let x[n] be a time-series.
Let w[n] represent the window over which analysis is to be performed.
Assumption: w[n] = 0 on the set {(−∞, 0)⋃
(Nw − 1,∞)}
Then, the STFT at time n can be expressed by:
X(n, ω) =m=∞∑m=−∞
x[m]w[n−m]ejωm
All of the transforms which will be utilized in this thesis are discrete. The reason
for this is discussed later in the section, “Auditory Feature Extraction on a Com-
puter.” The discrete transform is given by sampling X(n, ω) over the unit circle.
X(n, k) = X(n, ω)|ω= 2πNk =
m=∞∑m=−∞
x[m]w[n−m]e−j2πNkm
2πN
in the above equation represents the frequency sampling interval.
5
4 Construction of the MFCCs
Two methods by which MFCCs can be extracted from the signal will be pre-
sented in this section. The first is from the tuneR package for R, as well as [6]. A
slightly more mathematically rigorous approach can be seen in [11], which will be
presented after the traditional approach is given.
The general idea of MFCCs is very similar to the method of human auditory feature
extraction.
The intensity function for the Mel scale is:
M(f) = 1125 · ln(
1 +f
700
)
Figure 3: Mel Intensity & Critical Bands [6]
Sampling uniformly from the Mel scale and mapping down to the axis yields the
critical bands and frequencies. MFCCs attempt to mimic the filtering process of the
ear by combining the energy within critical bands. Naturally, this filtering process
works well for the purpose of speech processing; however, MFCCs are sometimes ap-
plied in more abstract problems.
More rigorously, in the traditional approach to extract MFCCs from a sound
file, the cochlea of the human ear is incorporated into the model by considering a
6
collection of bandpass filters. These filters are designed to ignore all frequencies out-
side of the critical band which they represent. As stated earlier, these critical bands
are centered at critical frequencies. These frequencies are typically chosen such that
the ability to distinguish between each successive critical frequency is some constant
amount.
1. Map the frequencies onto the intensity function: M(f)
2. Let c be a constant.
3. The set of critical frequencies fj must then satisfy M(fj+1)−M(fj) = c ∀j.
The section “tuneR MFCC code” is intended to be supplemental to the following
paragraphs which explain the MFCC extraction process as it is executed in traditional
methods. It can be skipped by the reader especially if they are not familiar with R.
As stated previously, the MFCC extraction process works with the frequency
decomposition of a sound file. STFTs were used to obtain this decomposition. STFTs
are FTs over incremental windows which cover the whole signal.
1. The frequency decomposition of the signal over each of these windows is stored
in [STFTs].
2. The intensity function, desired number of coefficients, and frequency overlapping
are used to define the [Scaling] matrix which takes the name T.windows in
“tuneR MFCC code.”
7
3. [STFTs] and [Scaling] are composed by matrix multiplication to combine the
energy over respective critical bands.
4. The log of the absolute value of this result is negated. This prepares it for
composition with the Inverse Discrete Cosine Transform (IDCT).
− log |[STFTs]x[Scaling]|
5. Finally matrix multiplication on the right by [IDCT] yields the MFCCs.
− log |[STFTs]x[Scaling]| x[IDCT ]
Since the IDCT can be interpreted as an inverse FT. It can be seen that this step
essentially undoes the transformation of the signal into frequencies using FTs from
the beginning of the process.
Figure 4: Mel Decomposition of a Signal [6]
Alternatively, the MFCCs are computed directly on the power spectrum in [11].
As with the traditional method, STFTs must be used first. An equivalent way to
express the STFT is: 12π
∫ −ππ
dw log |X(ejw)| · ejwk.
8
1. Transform the signal into its frequency decomposition:
ck =1
2π
∫ −ππ
dw log |X(ejw)| · ejwk (1)
Where |X(·)| represents the power spectrum.
2. The sequential application of a monotone invertible frequency warping function
g: [−π, π] → [−π, π] and Discrete Cosine Transform (DCT) can be expressed
as follows:
w → w̃ = g(w)
ck =1
2π
∫ −ππ
dw̃ log |X(ejw−1
(w̃))| · ejw̃k
3. By changing the variable of integration and using the derivative of the warping
function, dw̃/dw, the warping is directly incorporated into the cosine transfor-
mation.
ck =1
2π
∫ −ππ
dw log∣∣X(ejw)
∣∣ · ejg(w)k · g′(w) (2)
4. This is then approximated by a discrete sum.
ck ∼=1
N
N−22∑
n=0
{log∣∣∣X(ej
2πnN )∣∣∣ · cos
(g
(2πn
Nk
))· g′(
2πn
N
)}(3)
5. The specific frequency warping we are interested in for MFCCs is given by the
intensity function:
µ(w) = 2595 · log
(1 +
wfs2π · 700Hz
)(4)
Where fs denotes sampling frequency.
6. The Mel-Warping function must meet the criteria µ̃(π) = π for integration into
9
the cosine transformation.
µ̃(w) =π
µ(π)· µ(w) = d · log
(1 +
wfs2π · 700Hz
)(5)
where
d =π
log(1 + fs
2·700Hz
) (6)
7. By replacing g(·) by µ̃ in (3) the implementation of the MFCC computation
can be completed.
10
5 Auditory Feature Extraction on a Computer
Much like the human ear, computers acquire sound information from their en-
vironment by evaluating changes in air pressure induced by a sound wave. The
mechanical analog to the ear’s oval window is the microphone’s elastic membrane. It
fluctuates back and forth as a sound wave compresses it. By measuring the amount of
compression, a computer is able to obtain a list of amplitudes at discrete times which
represent the sound. The frequency of the compression is in fact identical to the
frequency of the corresponding sound, and the square of the pressure is proportional
to the sound’s intensity. Thus a sound wave can be completely characterized simply
by measuring fluctuations in the pressure induced along the membrane.
It is important to note that sound waves are continuous, however we are only
able to measure them discretely. This can potentially lead to some problems. As
stated previously, the FT of the data is typically interpreted as yielding its frequency
components, but the discrete nature by which we sample the data can actually obscure
the true frequencies. The signal which is reconstructed from these samples need not
necessarily be identical to the original wave. This behavior is referred to as aliasing.
Aliasing also happens in visual media. A typical example is the apparent counter
clockwise motion of cart-wheels in old cowboy movies. For an illustration of this
effect for waves, see figure 5.
Figure 5: Aliasing [6]
11
In fact, there are an infinite number of possible sine curves that can generate
the same regularly sampled data points. There are several ways in which aliasing can
be reduced. Clearly, a larger sampling rate reduces aliasing, since the reconstructed
function has to agree with the true function at the end points of the sampled intervals.
Additionally, the algorithm by which the frequencies are reconstructed can influence
how profound the effects of aliasing are. This issue was not investigated in great
detail for this thesis; however, keyboard strokes are hardly ordinary signals. It would
be interesting to see if improving the characteristic frequencies we attribute to each
signal by addressing the issue of aliasing could reveal more unique features for different
keys.
12
6 MFCCs in Action
MFCCs have been widely applied to problems everywhere from speech recog-
nition to analyzing mechanically produced signals. They are often used to address
the problem of music genre classification. They were shown to be highly successful
for this problem in [4]. MFCCs have been experimentally shown to work well for
instrument recognition, artist recognition, and music information retrieval [5].
MFCCs are also used for problems with more abstract sounds. They were suggested
as the best method of characteristic feature extraction for discrimination of computer
keyboard emanations by [1] and [2]. They were also used in [3], to discriminate
keyboard key strokes with SVMs. For the A, L, Q, and P keys, [3] was able to attain
an accuracy of 65.56% using MFCCs. The implications of using MFCCs are explored
in reference to this type of problem, and NMFCCs are developed in an attempt to
extract features which are more useful for discrimination. The results from [3] will be
used in the remainder of this thesis as a baseline which NMFCCs will be compared to.
Results must be interpreted in the context of the algorithm used to obtain them.
The following section describes the approach used in [3] for the four key classification
problem. It is not necessary for the theory of MFCCs and NMFCCs, and can be tem-
porarily skipped, if the reader wishes to continue developing them from a theoretical
stand point.
13
7 The Four Key Classification Problem
The training data set consists of 50 strokes of each of the A, L, Q, and P keys.
The key stroke signals are first isolated in the sound file, and then two sets of features
are extracted from them: autocorrelation coefficients and MFCCs. For each key,
65 autocorrelation coefficients are taken with lags ranging from 2 to 66. This is
augmented by a random number of MFCCs determined by 7 tuning parameters.
This matrix of features will be referred to as Xtraining. Furthermore, the true labels
for the training set will be denoted Ytraining. The SVM then learns the classification
rule for {Xtraining, Ytraining}. It finds this rule subject to the following constraints:
1. SVM-Type: C-classification
2. SVM-Kernel: radial
The discrimination rule is used on the features of the test data, Xtest, to generate
predicted labels Ypredicted. The accuracy of the discrimination rule is then evaluated
by comparing the true and predicted labels, Ytest, and Ypredicted. More explicitly, Ytest
is the following 90x1 matrix of labels:
> ytest =c("L", "A", "A", "L", "A", "L", "A", "A", "A", "L",
+ "L", "L", "L", "L", "L", "A", "A", "A", "A",
+ "A", "L", "A", "A", "L", "A", "A", "A", "L",
+ "A", "L", "L", "A", "A", "A", "A", "L", "A",
+ "L", "L", "A", "L", "A", "L", "A", "L","P","Q","Q","Q",
+ "Q","Q","Q","Q","P","P","Q","Q","Q",
+ "Q","Q","Q","P","P","P","P","P","P","Q","Q","Q",
+ "Q","Q","Q","P","Q","Q","Q","Q","P","Q","P","Q",
+ "P","Q","P","P","P","P","P","Q")
14
The discrimination was done with three binary decisions:
1. A against not A.
2. L against not L in what was not classified as an A in the first step.
3. Q against not Q in what was not classified as an L in the second step.
4. P is classified as whatever is not classified as Q in the third step.
Each step of the binary classification behaves as described above. More precisely,
each binary classification step goes through the below process:
A against not A:
1. Let 0 denote “not A.”
2. Define a new set of labels for the training data, YAtraining which contains only
A’s and 0’s.
3. Train the SVM on {Xtraining, YAtraining}.
4. Compare the learned decision rule with YAtest which contains only A’s and 0’s.
5. Remove everything classified as an A in this binary decision step.
6. Move on to the L against not L classification.
The following output is for the A against 0 binary classification step with the initial
values of the parameters.
15
Predicted Labels:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 A A A A A A A A A 0 0 A 0 0 A A
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
A A A A A A 0 A A A 0 A 0 A A A A
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
A 0 A A A A A A 0 A A A A A A A 0
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 0 0
86 87 88 89 90
0 0 0 0 0
True Labels:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 A A 0 A 0 A A A 0 0 0 0 0 0 A A
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
A A A 0 A A 0 A A A 0 A 0 0 A A A
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
A 0 A 0 0 A 0 A 0 A 0 0 0 0 0 0 0
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
86 87 88 89 90
0 0 0 0 0
16
Summary of results:
true
predicted 0 A
0 48 0
A 17 25
The algorithm guessed 25 A’s and 48 0’s correctly, while it incorrectly classified 17
0’s as A’s.
The total accuracy of the algorithm is defined as the percentage of total keys
correctly predicted. As stated previously, the highest number reached using MFCCs
was 65.56%.
17
8 Structure Imposed by MFCCs
Humans are not very good at distinguishing keyboard strokes. The whole point
of MFCCs is to emulate the human auditory filtration process to extract the key
features of the data. This only truly makes sense if humans have been shown to be
able to discriminate well for the data set in question. i.e. if the features extracted
by humans have been shown to work well for discrimination. MFCCs may be very
appropriate for problems involving speech recognition, but it’s not immediately clear
if they are the best choice for the keyboard classification problem.
On a more fundamental level, the steepness of the intensity function of the Mel
scale determines the shape of the bands of frequencies whose energies are summed to-
gether. It can thus be seen that this intensity function is actually imposing a structure
on the characteristic features you can observe. The intensity function rises sharply
over low frequencies and then slowly flattens out. This results in thin critical bands
at low frequencies and wide bands at high frequencies.
Figure 6: Mel Scale and Incompatible True Characteristics [6]
Suppose that the true characteristic features of your data which allow discrim-
ination are not concentrated at low frequencies. An algorithm, which combines fre-
quencies in critical bands as defined by the intensity function above, might group
them together. See Figure 6 for an illustration. The characteristic features that are
useful for discrimination are labeled by x’s. Only the joint effects of characteristic
18
frequencies belonging to the same band are maintained. An intensity function which
is steep for high frequencies and relatively flat for low frequencies, would yield critical
bands which are more compatible with the data in Figure 6. Using the same number
of critical bands, it would be possible for such a function to maintain the individual
effects of more of these features.
This is just one example of a possible distribution of features. The idea behind
NMFCCs is that it is better to allow the data to tell you where its characteristic
features are, rather than presuming their density over the frequency domain a priori.
19
9 Construction of the NMFCCs
NMFCCs are for the most part identical to MFCCs. The key difference is that
their intensity function is modeled to fit the types of sounds which are being ana-
lyzed, as opposed to being based on the principles of human physiology. As with the
traditional MFCCs, the derivation of the NMFCCs begins with the STFT of the data.
This maps the data from the time-amplitude domain into the time-frequency domain.
The goal is to obtain a new intensity function that will represent the structure of the
characteristic features better. To do this, the frequencies that were obtained from
the Fourier Transform (FT) were sorted from least to greatest. This removes time
ordering sensitive data from the intensity function.
Recall: The STFT’s are composed with the ordered scale thus summing along
respective critical bands and recovering at least some of the time ordering.
[STFTs]x[Scaling]
Since the frequencies which make up a key stroke may not be the same each time
it is pressed, the frequency decompositions for many keys were combined to get a bet-
ter model for an average key hit. As a result, the size of the representation obtained
for the new intensity function was large. For the problem of key discrimination, it
was characterized by approximately 90,000 entries. In order to construct the new
[Scaling] matrix, new frequencies, not necessarily in the original 90,000 characteris-
tics need to be mapped by the new intensity function. To achieve this, a parametric
model for the intensity function was created.
First the frequency decomposition for a collection of keys was stored in F. The
data was sorted from least to greatest among each STFT window, and then from
least to greatest for each position over all windows. The result was that the columns
20
contain the nth smallest entry for each window sorted from least to greatest.
Figure 7: 1st smallest Figure 8: 25th smallest
Figure 9: 50th smallest
These are initially compared with exponentials centered about the mean of domain.
The residuals from this regression are used to help define the next set of parameters
which will be introduced. In greater detail:
1. Define T to be a sequence starting from 1.
2. Find the optimal β0 for frequency = β0·e0.01(T−µ) subject to standard regression
assumptions.
21
Figure 10: Residuals for 1st Figure 11: Residuals for 25th
Figure 12: Residuals for 50th
The data points after the minimum residual occurs are not considered. It is possible
that there could be important information in these few final points, but they do not
follow the same form as the rest of the data. Instead of adding more parameters into
the model to explain them sufficiently, they were thrown out. Exploring the amount
of information which is contained in these points is an area which further research
could investigate.
The locations of the maximum and minimum residuals from the above regressions
were used to define the cutoffs for a spline model which is exponential over the first
region and the sum of an exponential and a line over the second region. Some examples
of the residuals after the separation of the exponential and introduction of the linear
22
component are given below:
Figure 13: Residuals for 1st Figure 14: Residuals for 25th
Figure 15: Residuals for 50th
23
The fully sorted data is shown in the following graph. The information gained over
the aforementioned regressions will be used to help create a parametric model.
Figure 16: Combined Sorted Frequencies
These regressions were applied on a larger scale to the combined data set. The
information gained about the residuals was used to search more efficiently for the
optimal cutoff points for the spline. The functions defined on this region are also
dependent on the indices at which the maximum and minimum residuals were found
in the preliminary steps.
Summary of the Regression applied to the Combined Sorted Data
Call:
lm(formula = AllscfL[1:K2] ~ y2f + ys2f + linf)
Residuals:
Min 1Q Median 3Q Max
-87988.3 -531.9 107.1 792.7 44758.7
24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.488e+02 1.267e+01 59.11 <2e-16 ***
y2f 7.152e+02 7.101e-01 1007.16 <2e-16 ***
ys2f 1.586e-183 0.000e+00 Inf <2e-16 ***
linf -7.222e+02 2.454e+00 -294.26 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3267 on 86773 degrees of freedom
Multiple R-squared: 0.9473, Adjusted R-squared: 0.9473
F-statistic: 5.199e+05 on 3 and 86773 DF, p-value: < 2.2e-16
This will serve as the NMFCC intensity function for the four key classification prob-
lem.
25
10 Performance & Conclusion
The NMFCCs were used in [3]’s algorithm in lieu of the MFCCs. Without
increasing the number of coefficients, the accuracy was improved by approximately
30.51% from 65.56% to 85.56%. The output from R below is for the initial values of
the parameters using the new NMFCCs.
Results from the First Binary Classification:
true
predicted 0 A
0 58 2
A 7 23
Results from the Second Binary Classification:
true
predicted 0 L
0 40 0
L 0 20
Results from the Third Binary Classification:
true
predicted 0 Q
0 12 0
Q 4 24
26
Results from classifying P as what remains:
true
predicted 0 P
0 0 0
P 2 10
The algorithm guesses correctly for 23 + 20 + 24 + 10 = 77 keys out of a total of 90
in the test set.
There are many aspects of NMFCCs which remain to be investigated. The ap-
proximation for the intensity function which was used still had very large errors in
some spots, but performed well with classification nevertheless. NMFCCs are con-
structed by the prevalence of certain frequencies in a signal, but this may not be truly
representative of the characteristic frequencies which allow for discrimination. Inten-
sity functions which are defined by the variability in the prevalence of frequencies in
certain areas may perform better.
In this case, it is clear that objective intensity functions yield better features
for discrimination. For speech recognition problems, MFCCs are logically very good
features to use; however, to make these results more general, future research could
look into the classes of problems for which NMFCCs and objective frequency functions
provide significant improvement of extracted features.
27
11 tuneR MFCC code
The following code is taken from the MFCC function in the tuneR package for R. It
is supplemental to the traditional approach presented in the section “Construction of
the MFCCs.”
> MFCC
function (object, a = 0.1, HW.width = 0.025, HW.overlapping = 0.25,
T.number = 24, T.overlapping = 0.5, K = 12)
{
if (!is(object, "Wave"))
stop("’object’ needs to be of class ’Wave’")
validObject(object)
if (object@stereo)
stop("Stereo processing not yet implemented...")
S <- object@left
sampling.rate <- [email protected]
S.length <- length(S)
S.fil <- filter(S, filter = c(1, -a))
S.fil <- S.fil[-S.length]
HW.length <- sampling.rate * HW.width
HW.shift <- HW.length * (1 - HW.overlapping)
coefficients <- floor(S.length/2) + 1
STDFT <- stftMFCC(S.fil, win = HW.length, inc = HW.shift,
coef = coefficients, wtype = hamming.window)
HW.number <- dim(STDFT$values)[1]
time <- (STDFT$center - 1)/(sampling.rate - 1)
frequency <- (0:(coefficients - 1))/(2 * coefficients) *
sampling.rate
28
Mel <- 2595 * log10(1 + frequency/700)
T.length <- (max(Mel) - min(Mel))/(T.number * (1 - T.overlapping) +
T.overlapping)
T.centers <- min(Mel) + 0.5 * T.length + (0:(T.number - 1)) *
(1 - T.overlapping) * T.length
T.windows <- matrix(0, coefficients, T.number)
for (m in 1:T.number) {
T.windows[, m] <- 1 - (abs(Mel - T.centers[m]))/(0.5 *
T.length)
T.windows[T.windows[, m] < 0, m] <- 0
}
IDCT <- sapply(1:K, function(x) cos(x * (1:T.number - 0.5) *
pi/T.number))
X <- log(abs(STDFT$values %*% T.windows)) %*% IDCT
X <- cbind(time, X)
return(X)
}
<environment: namespace:tuneR>
29
References
[1] Agrawal, Rakesh, and Asonov, Dmitri, “Keyboard Acoustic Emanations,” IBM
Almaden Research Center, pp. 1-9, 2004.
[2] Meadows, Catherine, and Paul Syverson,“Keyboard Acoustic Emanations
Revisited,” Proceedings of the 12th ACM Conference on Computer and
Communications Security pp.373-382, New York: ACM, Nov. 2005.
[3] Lester, Matthew D. “Sound Classification” Thesis. University At Albany, State
University of New York, 2010. Print.
[4] Jensen, Jesper Højvang, and Christensen, Mads Græsbøll, “Evaluation Of
MFCC Estimation Techniques For Music Similarity,”
“http://kom.aau.dk/̃ jhj/files/jensen06mfcc.pdf.”
[5] Jensen, Jesper Højvang, and Christensen, Mads Græsbøll, “Quantitative
Analysis of a Common Audio Similarity Measure,”
“http://www.ee.columbia.edu/̃ dpwe/pubs/JensCEJ09-quantmfcc.pdf.”
[6] Camastra, Francesco, and Alessandro, Vinciarelli, “Machine Learning for Audio,
Image and Video Analysis: Theory and Applications,” London: Springer, 2008.
Print.
[7] “Fourier Transform,” Wikipedia, the Free Encyclopedia,
“http://en.wikipedia.org/wiki/Fourier transform.”
[8] “Discrete Cosine Transform,” Wikipedia, the Free Encyclopedia,
“http://en.wikipedia.org/wiki/Discrete cosine transform.”
[9] “Mel-frequency cepstrum,” Wikipedia, the Free Encyclopedia,
“http://en.wikipedia.org/wiki/Mel-frequency cepstral coefficient.”
[10] Encyclopaedia Britannica. Chicago: Encyclopaedia Britannica, 1997. Print.
30
[11] Molau, Shirko and Pitz, Michael, “Computing Mel-Frequency Cepstral
Coefficients On The Power Spectrum,”
“http://acccn.net/cr569/Rstuff/keys/mfccs.pdf.”
[12] T. F. Quatieri, “Discrete-Time Speech Signal Processing: Principles and
Practice,” PHI, 2002. Print.
31