IMPROVEMENT OF AUDIO FEATURE EXTRACTION …eprints.uthm.edu.my/7917/1/KOHSHELAN_SUNDARARAJOO.pdf · disebabkan oleh getaran daripada alat muzik tradisional India khususnya alat muzik

IMPROVEMENT OF AUDIO FEATURE EXTRACTION TECHNIQUES IN

TRADITIONAL INDIAN STRING MUSICAL INSTRUMENT

KOHSHELAN A/L SUNDARARAJOO

A thesis submitted in

fulfillment of the requirement for the award of the

Degree of Master of Information Technology

Faculty of Computer Science and Information Technology

Universiti Tun Hussein Onn Malaysia

AUGUST 2015

vi

ABSTRACT

Audio feature extraction is an essential and significant process where audio features are

extracted from the audio files whereby the extracted audio features contains

relevant audio information. One of the important roles played by the audio features is to

improve the classification accuracy. However, the presence of noise in the audio signals

which degrades the quality of the extracted features may result in low classification

accuracy. Some of the existing audio feature extraction techniques are Mel-Frequency

Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC), Local Discriminant

Bases (LDB), Zero-Crossing Rate (ZCR) and Perceptual Linear Prediction (PLP).

Furthermore, the three frequently used techniques in audio feature extraction are MFCC,

LPC and ZCR. Previous research had mentioned the shortcomings of the three

techniques on extracting noisy signal. This has been identified in the case of traditional

Indian musical instrument where the vibration of string instrument had produced noise

in the highest amplitude. Therefore, Zero Forcing Equalizer (ZFE) was proposed to

equalize the noise in the highest amplitude. ZFE was integrated with three audio feature

extraction techniques, namely MFCC-ZFE, LPC-ZFE and ZCR-ZFE in order to improve

the performance of the existing techniques. The results show the best improvement of

classification accuracies obtained for the proposed techniques of MFCC-ZFE were

98.2% of classification accuracies with 4.0% of improvement by using kNN. Meanwhile,

the combined features of the MFCC-ZFE + LPC-ZFE + ZCR-ZFE have obtained 98.3%

of classification accuracies with 9.1% of improvement by using kNN.

vii

ABSTRAK

Pengekstrakan ciri audio adalah satu proses yang penting di mana ciri-ciri audio yang

diekstrak dari fail audio mengandungi maklumat yang relevan. Salah satu peranan

penting yang dimainkan oleh ciri-ciri audio adalah untuk meningkatkan prestasi

pengkelasan audio. Walau bagaimanapun, disebabkan oleh kehadiran bunyi hingar di

dalam isyarat audio mengurangkan kualiti ciri-ciri yang diekstrak serta merendahkan

prestasi pengkelasan audio. Beberapa teknik pengekstrakan ciri audio yang sedia ada

ialah Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC),

Local Discriminant Bases (LDB), Zero-Crossing Rate (ZCR) dan Perceptual Linear

Prediction (PLP), tetapi tiga teknik yang sering digunakan ialah MFCC, LPC dan ZCR.

Kajian sebelum ini telah mengenal pasti kelemahan dalam tiga teknik tersebut yang

disebabkan oleh getaran daripada alat muzik tradisional India khususnya alat muzik

bertali yang menghasilkan bunyi hingar dalam amplitud tertinggi. Oleh itu, Zero Forcing

Equalizer (ZFE) dicadangkan untuk menyelaraskan bunyi hingar dalam amplitud

tertinggi. Dalam kajian ini, ZFE diintegrasikan dengan tiga teknik pengekstrakan ciri

audio, iaitu MFCC-ZFE, LPC-ZFE dan ZCR-ZFE untuk meningkatkan prestasi teknik

yang sedia ada. Hasil penelitian menunjukkan peningkatan ketepatan pengkelasan audio

yang terbaik diperolehi oleh MFCC-ZFE ialah 98.2% pengkelasan audio dengan 4.0%

peningkatan ketepatan pengkelasan audio menggunakan kNN. Selain itu, penggabungan

ciri-ciri audio daripada MFCC-ZFE + LPC-ZFE + ZCR-ZFE memperoleh 98.3%

pengkelasan audio dengan 9.1% peningkatan ketepatan pengkelasan audio menggunakan

kNN.

viii

CONTENTS

TITLE i

DECLARATION ii

DEDICATION iii

ACKNOWLEDGEMENT iv

PUBLICATION v

ABSTRACT vi

ABSTRAK vii

CONTENTS viii

LIST OF TABLES xii

LIST OF FIGURES xiv

LIST OF SYMBOLS AND ABBREVIATIONS xv

LIST OF APPENDICES xvi

CHAPTER 1 INTRODUCTION

1.1 Background of Study 1

1.2 Research Motivation 3

1.3 Objectives 5

1.4 Scope of Research 5

1.5 Importance of Research 6

1.6 Thesis Outlines 6

ix

CHAPTER 2 LITERATURE REVIEW

2.1 Introduction 8

2.2 Audio Feature Extraction Concept 8

2.3 An Overview of Audio

Feature Extraction Techniques 9

2.3.1 Mel Frequency Cepstral Coefficients

(MFCCs) 13

2.3.2 Linear Predictive Coding (LPC) 16

2.3.3 Zero-Crossing Rate (ZCR) 18

2.4 Noise in Audio Signal 19

2.5 Noise Filtering 20

2.6 Sound Equalizer 20

2.6.1 Zero Forcing Equalizer (ZFE) 22

2.6.2 Minimum Mean Square Error

(MMSE) Equalizer 24

2.7 Audio Classification Techniques Overview 25

2.7.1 K-Nearest Neighbor (kNN)

Classifier 25

2.7.2 Bayesian Network (BNs)

Classifier 26

2.7.3 Support Vector Machine (SVM)

Classifier 27

2.7.4 C4.5 Decision Tree Classifier 28

2.7.5 Naïve Bayes (NB) Classifier 29

2.8 An Overview of Instrument Sounds 30

2.9 Traditional Indian Musical Instrument 30

2.10 String Instrument 31

2.11 Carnatic music 32

2.12 Veena 33

2.13 Chapter Summary 35

x

CHAPTER 3 METHODOLOGY

3.1 Introduction 36

3.2 Research Methodology 36

3.2.1 Audio Acquisition 38

3.2.2 Audio Pre-processing 39

3.2.3 Audio Feature Extraction 40

3.2.3.1 The Proposed Technique of

MFCC-ZFE 42


LPC-ZFE 45


ZCR-ZFE 47

3.2.4 Audio Classification 50

3.2.5 Performance Measurement 50


CHAPTER 4 ANALYSIS AND RESULT

4.1 Introduction 53

4.2 Performance of Audio Feature Extraction

Techniques 53

4.3 ZFE and MMSE Equalizer Performance

Comparison 59


CHAPTER 5 CONCLUSION

5.1 Research Summary 67

5.2 Research Contributions 68

5.3 Research Advantages 69

5.4 Future Works 69

5.5 Conclusion 70

xi

REFERENCES 71

APPENDIX 82

xii

LIST OF TABLES

2.1 Comparison of audio feature extraction techniques

used by authors 10

2.2 Descriptions of five instruments from the class

of Veena 34

3.1 Audio Files Properties 39

3.2 Experimental Parameter for MFCC, LPC and

ZCR 41

3.3 The original Pre-emphasis algorithm 44

3.4 The modified Pre-emphasis algorithm 44

3.5 The original LPC Synthesizer algorithm 46

3.6 The modified LPC Synthesizer algorithm 46

3.7 The original ZCR energy calculation algorithm 48

3.8 The modified ZCR energy calculation algorithm 49

4.1 A comparison of the audio classification accuracy

between the original and the proposed techniques

in five different classifiers 54

4.2 A comparison of the audio classification accuracy

by combining the audio features in five

different classifiers 56

4.3 A comparison between classification accuracy and

RMSE for the proposed techniques 57

xiii

4.4 A comparison between classification accuracy,

F-Measure and MAE for the proposed techniques 58

4.5 A comparison between the ZFE and MMSE

equalizer 59

4.6 Confusion matrices for MFCC in five different

classifiers 60

4.7 Confusion matrices for MFCC-ZFE in five


4.8 Confusion matrices for LPC in five different

classifiers 62

4.9 Confusion matrices for LPC-ZFE in five


4.10 Confusion matrices for ZCR in five different

classifiers 64

4.11 Confusion matrices for ZCR-ZFE in five


xiv

LIST OF FIGURES

2.1 MFCC Block diagram 13

2.2 Spectrogram diagram 14

2.3 LPC Block diagram 17

2.4 ZCR Block diagram 18

2.5 Audio equalizer 21

2.6 Graphical model of BNs 27

2.7 Hierarchical of Traditional Indian Musical

Instrument 31

3.1 Research Methodology 37

3.2 Research Framework 37

3.3 Audio signal is cropped in 5 seconds 40

3.4 Spectrogram 42

3.5 Modified Pre-emphasis block diagram 43

3.6 Spectrogram for original MFCC 45

3.7 Spectrogram for MFCC-ZFE 45

3.8 Modified LPC Synthesizer block diagram 45

3.9 Spectrogram for original LPC 47

3.10 Spectrogram for original LPC-ZFE 47

3.11 Modified ZCR energy calculation block diagram 48

3.12 Spectrogram for original ZCR 49

3.13 Spectrogram for original ZCR-ZFE 49

xv

LIST OF SYMBOLS AND ABBREVIATIONS

f - Frequency

dB - Decibel

KHz - Kilohertz

MFCC - Mel Frequency Cepstral Coefficient

LPC - Linear Predictive Coding

ZCR - Zero Crossing Rate

kNN - k-Nearest Neighbor

BNs - Bayesian Network

SVM - Support Vector Machine

SMO - Sequential Minimal Optimization

NB - Naïve Bayes

ISI - Inter-Symbol Interference

ZFE - Zero Forcing Equalizer

MIMO - Multiple Input Multiple Output

MLSE - Maximum Likelihood Sequence Estimator

CBMIR - Content-based Music Information Retrieval

FFT - Fast Fourier Transform

RMSE - Root Mean Square Error

MAE - Mean Absolute Error

xvi

LIST OF APPENDICES

APPENDIX TITLE PAGE

A Classification averages for original and

proposed techniques 82

B Confusion matrices for combined features

in five different classifiers 84

CHAPTER 1

INTRODUCTION

1.1 Background of study

Audio feature extraction is a process to obtain a set of audio features from an audio signal

whereby the rich set of audio features can be utilized in the computation aspect such as in

determining the average, maximum or frequency values that can be plotted to a spectrogram

(Dave, 2013; Bullock, 2008). There are two types of extracted features that are low-level features

and high-level features (Lerch, 2012; Furht, 2009). Low-level features represent terms in which

humans refer to music such as pitch, tempo, amplitude and others. High-level features are

considered to have direct (humanly interpretable) and derived from the low-level features such as

genre and style. This research will be focusing on the low-level features since the research will

extract the audio features from the high amplitude of the signal. Audio feature extraction can

contribute for better audio classification accuracy result depending on the quality of the extracted

features (Umapathy, Ghoraani & Krishnan, 2010). However, one of the essential drawbacks in

the audio feature extraction process is the presence of disturbance such as noise from the high

amplitude signal which is produced from certain instrument that may degrade the quality of the

extracted features and lead to low classification accuracy (Stulov & Kartofelev, 2014; Dave,

2013; Wolf & Nadeu, 2008; Subramanian, 2006). There are many techniques for audio feature

2

extraction, such as Mel-Frequency Cepstral Coefficients (MFCC), Local Discriminant Bases

(LDB), Linear Predictive Coding (LPC), Perceptual Linear Prediction (PLP) and Zero-Crossing

Rate (ZCR) (Dave, 2013; Anusuya & Katti, 2011; Umapathy, Krishnan & Rao, 2007). However,

three audio feature extraction techniques are selected which are Mel-Frequency Cepstral

Coefficients (MFCC), Linear Predictive Coding (LPC) and Zero-Crossing Rate (ZCR) based on

their good performance and frequently used by previous researchers (Chougule & Chavan, 2014;

Bormane & Dusane, 2013; Anusuya & Katti, 2011; Kumari, Kumar & Solanki, 2010;

Gunasekaran & Revathy, 2008).

MFCC is one of the techniques commonly used in digital signal processing (Weeks,

2010). MFCC has been proven as one of the effective techniques in audio feature extraction

(Keronen et al., 2011; Furht, 2009). Similar to MFCC, LPC is another technique which offers a

powerful, yet simple method to extract the audio information (Elminir, ElSoud & El-Maged,

2012; Sheetal & Raut, 2012). On the other hand, ZCR is useful for musical instruments

measurement and endpoint detection (detection of the start and end of unvoiced sounds) (Khan,

Bhaiya & Banchhor, 2012). Even though these three techniques have performed well in

extracting audio features, they have shown some shortcomings in extracting noisy signal

especially from string instrument in the domain of traditional Indian musical instrument

(Anusuya & Katti, 2011; Chougule & Chavan, 2014; Bormane & Dusane, 2013).

Traditional Indian musical instrument is one of the oldest musical instruments in the

world (The Incredible India Travel, 2011). It has contributed immensely in making Indian music

famous around the world. Traditional Indian musical instrument can be categorized into four

types such as string instruments, percussion instruments, wind-blown instruments and solid

instruments (Gunasekaran & Revathy, 2008). Specifically, in the context of content-based audio

extraction and audio classification, these instruments have been used many times to identify the

characteristics of a single instrument such as pitch or amplitude among multi-instruments

(Gunasekaran & Revathy, 2008). However, it is difficult to identify the characteristics of a

particular instrument such as string instrument since each instrument creates a sound which

produces music in its own ways. Furthermore, these instruments may contribute to noisy signal

due to some disturbance from the indoor, outdoor or instrument itself. Among them, string

instruments such as Veena contributes more to noise due to sound produced from the vibration of

strings (Dave, 2013; Stulov & Kartofelev, 2014; Subramanian, 2006). In this research, string

3

instrument of traditional Indian musical instrument were used since they contribute to more noise

that come from the highest amplitude of the audio signal and lead to low quality of extracted

audio features.

Therefore, Zero Forcing Equalizer (ZFE) was proposed in this research to be integrated

with three audio feature extraction techniques, namely MFCC-ZFE, LPC-ZFE and ZCR-ZFE to

equalize the noise which comes from the highest amplitude. ZFE is a linear receiver (also known

as an equalizer) used in communication systems (Kaur & Kansal, 2013). The function of ZFE is

to invert the frequency response of the channel. ZFE has an ability to equalize the noise which

comes from the highest amplitude of the audio signals (Kaur & Kansal, 2013). ZFE is integrated

with audio feature extraction techniques to overcome the shortcoming occur in the existing

techniques as well as to improve the quality of extracted features. Furthermore, another linear

equalizer known as Minimum Mean Square Error (MMSE) equalizer is also integrated with the

three audio feature extraction techniques for performance comparison.

A good audio extraction technique will lead to better accuracy of audio classification. In

order to evaluate the performance of the proposed audio feature extraction techniques, five

benchmark classifiers were selected. These classifiers were K-Nearest Neighbor (kNN),

Bayesian Network (BNs), Support Vector Machine (SVM), C4.5 decision tree and Naïve Bayes

(NB) classifiers. The classifiers were selected based on their good performance in audio

classification (Kumar, Pandya & Jawahar, 2014; Bello, 2013; Rocha, Panda & Paiva, 2013;

Witten, Frank & Hall, 2011; Nettleton, Orriols-Puig & Fornells, 2010; Li, Ogihara & Li, 2003).

The classification results are compared to show the performance of the extracted audio features.

1.2 Research Motivation

As the need of reliable information from the audio and music grows, the importance of research

on audio feature extraction increases. Audio feature extraction has contributed immensely in

various fields such as in data mining involving Content-Based Music Information Retrieval

(CBMIR). CBMIR has become a critical research topic and has been given increasing attention

in recent years due to the extensive growth in audio and music (Yu et al., 2013). CBMIR

4

generally involves analyzing, searching and retrieving music based on audio features of an audio

which is normally used to represent songs or music genre. Identifying them would normally

involve feature extraction and classification tasks. Theoretically, the greater the features

analyzed, the better the classification accuracy can be achieved.

The impact of audio feature extraction in audio classification is huge since the

performance of audio classification accuracy can be defined based on the quality of extracted

audio features. Furthermore, good quality of extracted audio features may contribute to a better

accuracy of audio classification (Umapathy, Ghoraani & Krishnan, 2010). However, the quality

of audio features depends upon the behavior of the audio domain. Audio domain, such as

traditional India musical instrument is one of the oldest musical instruments in the world,

however, there is not much work done in the area of feature extraction compared to Western

music (Agarwal, Karnick & Raj, 2013). In the previous research, traditional Indian musical

instrument involving string instruments had shown fluctuating behavior in its audio signal during

different experimental setups that were identified by using MFCC, LPC and ZCR (Gunasekaran

& Revathy, 2008; Chougule & Chavan, 2014). This fluctuating behavior was identified due to

the vibration of string instruments which produced unwanted sound (noise) in highest amplitude

(Dave, 2013; Stulov & Kartofelev, 2014; Subramanian, 2006). Wolf and Nadeu (2008) also said

that MFCC performance degrades severely when the extracted features contain noise. Moreover,

Anusuya and Katti (2011) mentioned that if the audio signal used is noisy, the extracted features

from MFCC, LPC and ZCR lead to lower classification accuracy.

Based on the previous researches, their findings showed that the existing audio feature

extraction techniques are unable to produce high quality audio features due to the presence of

noise in the highest amplitude of the audio signal (Stulov & Kartofelev, 2014; Dave, 2013; Wolf

& Nadeu, 2008; Subramanian, 2006). This shortcoming have inspired the use of ZFE to equalize

the noise in the highest amplitude (Kaur & Kansal, 2013). Therefore, in this research, ZFE was

integrated with MFCC, LPC and ZCR to overcome the drawback of audio feature extraction

techniques in extracting noisy audio signal in highest amplitude. This research perceived that

improvement can be done to the audio feature extraction techniques to overcome the drawback

and lead to a better quality of extracted audio features.

5

1.3 Objectives

This study embarks on the following objectives:

i. To propose three (3) improved techniques of Mel-Frequency Cepstral Coefficient

(MFCC), Linear Predictive Coding (LPC) and Zero-Crossing Rate (ZCR) by integrating

them with Zero Forcing Equalizer (ZFE);

ii. To implement the proposed techniques in (i) in traditional Indian string musical

instrument; and

iii. To evaluate the performance of the proposed audio feature extraction techniques based on

audio classification accuracy by using five classifiers which are k-Nearest Neighbor

(kNN), Bayesian Network (BNs), Support Vector Machine (SVM), C4.5 decision tree

and Naïve Bayes (NB).

1.4 Scope of Research

The scope of research is divided into four parts which are audio acquisition, audio pre-

processing, audio extraction and audio classification. Audio acquisition, is the process of

collecting information on audio files such as sampling rate and bit depth. The dataset of audio

files from traditional Indian string musical instrument were collected from SoundCloud, the

online music sharing platform (SoundCloud, 2007). There were a total of 500 audio files since

each instrument from the class of Veena contribute to 100 audio files. Audio pre-processing is

the process to modify the audio signal based on user preferences. The audio files were cropped

when the amplitude value is bigger than or closer to the maximum value which is 0dB (Zytrax,

2014). Meanwhile, audio feature extraction involves the process of extracting low-level features

of audio files (Lerch, 2012; Furht, 2009). Three audio feature extraction techniques were used

which are Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding (LPC) and

Zero-Crossing Rate (ZCR). Audio classification, on the other hand, is the process to categorize

the audio features into a sample of classes in order to obtain their classification accuracy by

6

using different classifiers. Five audio classifiers were used which are k-Nearest Neighbor (kNN),

Bayesian Network (BNs), Support Vector Machine (SVM), C4.5 decision tree and Naïve Bayes

(NB). In addition, in order to validate the performance of ZFE equalizer, another similar

equalizer known as Minimum Mean Square Error (MMSE) equalizer was used in this research

(Kumar and Kaur, 2012). MMSE equalizer was integrated with the audio feature extraction

techniques in the same way ZFE was integrated with the techniques. The performance of ZFE

and MMSE will be compared.

1.5 Importance of Research

The proposed techniques are important because it can be utilized in many areas or field involving

soft computing. By studying the techniques used in the audio feature extraction process, the

proposed techniques are hoped to provide a better performance of audio classification accuracy

when extracting audio signal that contains noise in the highest amplitude. Therefore, this

research hopes to solve the problem of audio feature extraction techniques in the domain of

Indian traditional instruments.

1.6 Thesis Outlines

For the remaining chapters of the thesis is structured as follows:

Chapter 2 provides a detail explanation of the audio feature extraction techniques used in the

research which is Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding

(LPC) and Zero-Crossing Rate (ZCR). Moreover, this chapter also discusses on the sound

equalizer which is Zero Forcing Equalizer (ZFE) and Minimum Mean Square Error (MMSE).

The benchmark classifiers which are k-Nearest Neighbor (kNN), Bayesian Network (BNs),

Support Vector Machine (SVM), C4.5 decision tree and Naïve Bayes (NB) are used to evaluate

7

the performance of audio feature extraction techniques. In addition, the background of traditional

Indian musical instrument is mention in this chapter.

Chapter 3 outlines the phases involved in the methodology. The methodology consisted of four

phases which is audio acquisition, audio pre-processing, proposed audio feature extraction

techniques and audio classification. Besides, this chapter also provides an explanation on the

performance measurement in term of classification accuracy, Root Mean Square Error (RMSE)

F-Measure and Mean Absolute Error (MAE).

Chapter 4 presents the performance evaluation results based on the classification accuracy,

RMSE, F-Measure and MAE of the extracted audio features of audio feature extraction

techniques. In this chapter, the comparisons between the original and proposed techniques were

mentioned. In addition, a comparison between ZFE and MMSE equalizers is presented. A detail

explanation on the confusion matrices is provided.

Chapter 5 concludes and highlights the research finding. This chapter summarizes the research

outcome and discusses the advantages of the research. Besides, this chapter provides the future

work that can be done to further enhance the research.

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This chapter outlines the techniques used in audio feature extraction and audio classification.

In addition, a detail description of the techniques were provided. Furthermore, this chapter

also provides an explanation on the sound equalizer used in the research. Besides, the

overview on the domain of tradition Indian string musical instrument was introduced in this

chapter.

2.2 Audio Feature Extraction Concept

Audio feature extraction is a process that involves transforming audio data to a set of features

such as pitch, timbre, and others (Bormane & Dusane, 2013; Bullock, 2008). Specifically,

audio feature extraction process addresses the analysis and extraction of meaningful

information from audio signals. The objective of the audio feature extraction process is to

capture the relevant information on audio signal to get a higher-level understanding of the

audio signal (Dave, 2013; Bullock, 2008). Furthermore, the extracted features of the audio

signal may provide a higher-level understanding of the amplitude or frequency components

9

of the audio signal by plotting to the spectrogram (Dave, 2013; Bullock, 2008). Previously,

extracted audio features are obtained by using different type of audio feature extraction

techniques such as Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding

(LPC) and Zero-Crossing Rate (ZCR) for musical instrument recognition describing various

sound qualities (Bormane & Dusane 2013; Umapathy, Krishnan & Rao, 2007). However,

according to Kumari, Kumar and Solanki (2010), the audio features become quite hard to

extract robustly when dealing with musical phrases such as bass-line, percussion loops and

others. Therefore, choosing the right feature extraction techniques to extract the features is

crucial. In addition, the field of music feature extraction is a wide research area, and

improving feature extraction will most likely have a major impact on the performance of an

instrument classification system (Gunasekaran & Revathy, 2008). In the next section, the

research will provide detail descriptions on the chosen audio feature extraction techniques.

2.3 An Overview of Audio Feature Extraction Techniques

There are many techniques that has been used in audio feature extraction such as as Mel-

Frequency Cepstral Coefficients (MFCC), Local Discriminant Bases (LDB), Linear

Predictive Coding (LPC), Perceptual Linear Prediction (PLP) and Zero-Crossing Rate (ZCR)

(Dave, 2013; Anusuya & Katti, 2011; Umapathy, Krishnan & Rao, 2007). In addition, these

techniques have their own strengths and drawbacks in extracting the audio signals (Dave,

2013; Keronen et al., 2011; Ngo, 2011; Patil et al., 2012; Sheetal & Raut, 2012; Umapathy,

Krishnan & Rao, 2007). Table 2.1 provides the comparison of audio feature extraction

techniques used by researchers. For example, audio feature extraction techniques such as

MFCC has proven to be one of the effective techniques used in audio feature extraction

which has given a better performance, especially in audio recognition rate (Kumar, Pandya &

Jawahar, 2014; Xie, Cao & He, 2012; Keronen et al., 2011; Furht, 2009). According to

Chachada & Kuo (2014), MFCC features have performed better than LDB features in the

domain of artificial (instrumental) and natural sounds. In another study conducted by Dave

(2013), the author used MFCC, LPC and PLP audio feature extraction techniques and have

stated that compared to MFCC and LPC, PLP is more useful in the domain of speech signals

since PLP can discard irrelevant information of speech audio signals whereby improve the

speech recognition rate. Furthermore, compared to LDB and PLP, audio feature extraction

10

techniques such as MFCC, LPC and ZCR particularly has been applied many times by the

previous researchers in the domain of music instrument, specifically in the traditional Indian

musical instrument (Chougule & Chavan, 2014; Bormane & Dusane, 2013; Anusuya & Katti,

2011; Kumari, Kumar & Solanki, 2010; Gunasekaran & Revathy, 2008). Even though,

MFCC, LPC and ZCR have shown their good performance, the previous researches have

pointed out the drawback of these three establish techniques in extracting noisy signals in the

highest amplitude which caused quality degradation in the extracted features and may lead to

low classification accuracy (Stulov & Kartofelev, 2014; Dave, 2013; Wolf & Nadeu, 2008;

Subramanian, 2006). Detail explanation on techniques is described in the next subsection.

Table 2.1: Comparison of audio feature extraction techniques used by authors

Author/

Year

Feature

extraction

techniques

Classifiers Data Result

Chachada

and Kuo

(2014)

MFCC

Local

Discriminant

Bases (LDB)

K-Nearest

Neighbor (kNN)

Support vector

machines (SVM)

37 classes of

artificial

(instrumental)

and natural

sounds

(Example:

Bells,

Clapping,

Thunder and

etc.)

MFCC features have performed

better than LDB features by

using kNN in the domain of

artificial (instrumental) and

natural sounds

Kumar,

Pandya and

Jawahar

(2014)

MFCC K-Nearest

Neighbor (kNN)

Bayes Network

(BNs)

Support Vector

Machines (SVM)

10 Indian

ragas

(melody)

consisting of

170 tunes

In recognition of raagas

(melody), kNN and BNs

achieved more than 70% of

classification accuracy and SVM

achieved more than 80% of

classification accuracy by using

MFCC

Bormane &

Dusane

(2013)

MFCC

LPC

ZCR

Wavelet Packet

Transform

(WPT)

4 classes of

musical

Instruments

(Example:

Sitar, Piano,

Guitar, and

etc.)

Family Notes

Recognition

String

Instruments

More than 60%

of recognition

rate

Keyboard

Instruments

More than 50%

of recognition

rate

Woodwind

Instruments

More than 50%

of recognition

rate

Brass

Instruments

More than 60%

of recognition

rate

11

Table 2.1 (continued)

Author/

Year

Feature

extraction

techniques


Dave

(2013)

MFCC

LPC

PLP

Support Vector

Machine (SVM)

Artificial Neural

network (ANN)

Music and

Speech

Signals

Discussion

Compared to MFCC and LPC,

PLP is more useful in the domain

of speech signals

ANN is more preferable to be used

in the field of speech recognition

Rocha,

Panda

and Paiva

(2013)

Standard

audio features

(SA) (spectral

features and

MFCC )

Melodic

audio features

(MA)

K-Nearest

Neighbor (kNN)

Bayes Network

(BNs)

Support Vector

Machines (SVM)

C4.5 decision tree

Naïve Bayes (NB)

903 datasets

of emotional

music

Classifier Best result

BNs 40% - 62%

NB 39% - 48%

kNN 40% -60%

C4.5 decision

tree

34% - 60%

SVM 45% - 64%

Patil

(2012)

MFCC

LPC

ZCR

Short-Time

Energy (STE)

Spectral

Feature ( SF)

Support Vector

Machines (SVM)

Humming

sounds

produced by

two male

speakers

The fusion system of

MFCC+ZCR+STE+SF gives

86.29% of classification accuracy

compared to the MFCC which is

77.71%.

Sheetal

and Raut

(2012)

LPC

Wavelet Packet

Transform (WPT)

Music and

Speech

Signals

Discussion

LPC can remove the redundancy

in the signal and has the highest

rate of audio compression

Xie, Cao

and He

(2012)

MFCC

LPC

ZCR

Support Vector

Machine (SVM)

40 samples

sound of

cutting

down trees,

sawing trees

and trees

collapse

Compared to LPC and ZCR,

MFCC can give better

performance in audio recognition

with more than 80% of recognition

rate.

Anusuya

and Katti

(2011)

MFCC

LPC

PLP

Wavelet Packet

Transform (WPT)

500 samples

of clean and

noisy

Kannada

audio

signals

Clean

signal

Noisy

signal

MFCC 70-90% 50-70%

LPC 70-80% 50-60%

PLP 40-60% 60-70%

12

Table 2.1 (continued)

Author/

Year

Feature

extraction

techniques


Keronen et

al.

(2011)

MFCC

Gaussian

Mixture Model

(GMM)

Recorded

audio signals

contained

noise from

public place

and car

environments

The performance of MFCC

degraded due to the mismatch

between the training and

recognition noise environments

Ngo

(2011)

LPC

ZCR

SVM Noisy audio

signal

captured from

the

microphone

and outputted

to the

loudspeaker

in hearing

aids

ZCR has high signal frequency

rate in unvoiced sound compared

to voiced sound

Kumari,

Kumar and

Solanki

(2010)

MFCC

Auto-

correlation

Artificial Neural

network (ANN)

5 different

type of North

Indian

musical

instruments

MFCC Autocor

-relation

Flute 61% 79.3%

Dholak 77.0% 80.7%

Sitar 60%

Bhapang 64.70%

Instrument

family

72 %

Gunasekaran

and Revathy

(2008)

MFCC

K-Nearest

Neighbor (kNN)

Multi-layer

perceptron

(MLP)

Gaussian

Mixture

Model(GMM)

10 Indian

musical

instrument

(Example:

Veena, Sitar,

Indian Flute

and etc.)

Confidence-based Fusion results

KNN + MLP 93.6%

KNN + GMM 92.8%

MLP+GMM 90.9%

KNN+MLP+GWM 92.1%

Umapathy,

Krishnan

and Rao

(2007)

Local

Discriminant

Bases (LDB)

MFCC

Local

Discriminant

Bases (LDB)

Artificial and

natural

sounds

(example:

drums, flute,

animals and

etc.)

(213 sounds)

LDB & MFCC (level of average

accuracy)

First level 91%

Second level 99%

Third level 95%

13

2.3.1 Mel Frequency Cepstral Coefficients (MFCCs)

Mel Frequency Cepstral Coefficients (MFCCs) are cepstral coefficients used for representing

audio in a way that mimics the physiological properties of the human auditory system

(Kumari, Kumar & Solanki, 2010; Sukor, 2012). The mel scale was developed based on the

study of human auditory perception. MFCCs are commonly used in speech recognition and

are increasingly used in music information recognition and genre classification systems

(Kumari, Kumar & Solanki, 2010). MFCC technique has its own advantages and

disadvantages in extracting audio signal. According to Xie, Cao and He (2012), MFCC will

give better performance in audio recognition rate. In addition, MFCC will take a short time

for extracting the features of audio signal (Keronen et al., 2011). However, the drawback of

MFCC is it will produce low quality of audio features and leads to low classification

accuracy if the signal used is noisy (Anusuya & Katti, 2011). Figure 2.1 shows the block

diagram of MFCC.

Figure 2.1: MFCC Block diagram (Dave, 2013)

From the MFCC block diagram shown in Figure 2.1, the audio signal from the Indian

musical instrument is passed into a process called pre-emphasis. The function of the pre-

emphasis process is to boost the energy of signal and amplify the importance of high-

frequency formant (Chougule & Chavan, 2014; Le-Qing, 2011; Zhu & Alwan, 2003).

Formants are the area in a spectrogram that shows the presence of noise in the highest

amplitude by displaying the area as dark bands. Besides, the darker formants produced in the

spectrogram shows the audio signals have a stronger energy (amplitude) (Prahallad, 2011;

Aslam et al., 2010). The function of the spectrogram is to plot the audio signal in amplitude,

frequency or time as shown in Figure 2.2.

Mel

Cepstrum Mel

Spectrum Spectrum

Frame

Pre-emphasis Frame

Blocking

DFT

Windowing

Cepstrum Mel-frequency

Wrapping

Audio Signal

14

Figure 2.2: Spectrogram diagram (McMurray, 2004)

The equation for pre-emphasis is shown in Equation (2.1):

H(z)=1-az-1 (2.1)

where z is the Discrete Fourier Transform (DFT) of input signal and a is the pre-emphasis

alpha coefficient where the value is usually between 0.9 and 1.0 (the standard value is 0.97).

Then, the signal will be passing to the frame blocking where the signal will be

blocked into frames of N samples, with adjacent frames being separated by M (M<N). The

first frame consists of first N samples; second frame begins with M samples after the first

frame, and overlaps it by N-M samples and so on (Gadade, Jadhav & Deogirkar, 2010). The

standard parameter or value for M and N is M = 100 and N = 256. The next step is windowing

where each individual frame is then windowed by using Hamming window in order to

minimize the signal discontinuities at the beginning and end of each frame by taper the signal

to zero (Dave, 2013). The Hamming window, w(n) is computed according to the Equation

(2.2):

w(n) = 0.54 – 0.46 cos (2πn / N-1), 0 ≤ n ≤ N -1 (2.2)

where N is total number of sample and n is current sample.

Then, the signal will be processed by using Discrete Fourier Transform (DFT). DFT

convert the sampled function from its original domain (often time or position along a line) to

the frequency domain. Therefore, each frame of N samples from the time domain is converted

15

into the frequency domain which is defined on the set of N samples {xn}, as shown in

Equation (2.3):

N-1

Xk = ∑ xne-j2πkn / N, k = 0, 1, 2,…, N-1 (2.3)

n=0

where Xk’s are complex numbers and their absolute values (frequency magnitudes) are

calculated. Xk is interpreted as follow: positive frequencies 0 ≤ f < Fs / 2 correspond to values

0 ≤ n ≤ N / 2-1, while negative frequencies -Fs / 2 < f <0 correspond to N / 2+1≤ n≤ N-1.

Here, Fs is considered as sampling frequency.

The output of DFT is defined in spectrum. In the next step, the spectrum is wrapped

through a process named mel-frequency wrapping and expressed in the mel frequency scale.

According to Dave (2013), mel-frequency scale is linear frequency spacing below 1000 Hz

and a logarithmic spacing above 1000 Hz. The output is defined as mel spectrum. The

approximate formula to compute the mel is shown in Equation (2.4):

mf = 2595log10 (f /700 + 1) (2.4)

where f is frequency and measured in Hz.

Cepstrum inverses the Fourier transform of the logarithm of the estimated spectrum of an

audio signal as shown in Equation (2.5):

K

Ĉn=∑(log Ŝk) cos[n*(k-0.5)*π/K], n=0,1, ..., K-1 (2.5)

k=1

where K is a number of mel spectrum coefficients.

From the equation (2.5), the result is expressed in mel cepstrum and is referred to as the mel-

scale cepstral coefficients, or MFCC which is shown in Equation (2.6):

16

M

Ci (n) = ∑ S (m) cos[πn(m-0.5)/M] , 0≤m<M (2.6) m-1

where n is the number of MFCC, Ci(n) is the n-th MFCC coefficients of the i-th frame, S(m)

is the logarithmic power spectrum of the audio signal, and M is the number of triangular

filters.

2.3.2 Linear Predictive Coding (LPC)

Similar to MFCC, Linear Predictive Coding (LPC) is another technique which offers a

powerful, yet simple method to extract audio information. LPC algorithm produces a vector

of coefficients that represents a smooth spectral envelope of a temporal input signal (Elminir,

ElSoud & El-Maged, 2012). The strength of LPC is it can remove the redundancy in signal.

In addition, LPC has the highest rate of audio compression (Sheetal & Raut, 2012) and take

short training time, even though it will take long time to extract the feature of audio signal

(Wolf & Nadeu, 2008). Nevertheless, using LPC alone for the recognition process is not very

successful because the all pole assumption of the vocal cord transfer function was not

accurate (Wolf & Nadeu, 2008). On the other hand, LPC coder does not work well for low or

high pitch frequency of voices (speech) signal (Kamal, Sarkar & Rahman, 2011). Another

importance drawback of LPC is the extracted audio features may lead to low performance of

audio classification accuracy if there is noise in the highest amplitude of an audio signal

(Chougule & Chavan 2014; Dave, 2013). This is due to LPC synthesizer process amplified

the noise in highest amplitude that may decrease the quality of extracted features. Figure 2.3

shows the LPC block diagram.

17

Figure 2.3: LPC Block diagram (Alwan, 2002)

First, the audio signal will be sent to LPC analyzer to determine the key features of

the signal and try to encode the signal as accurately as possible. The key features resemble

the loudness of the signal determine whether the sound is voiced or unvoiced. Then, the

signal is sent to the channel. The function of the channel is to transmit or serve as the medium

for transmission. Then, the signal will be passed to a decoder. Decoding involves using the

parameters acquired in the encoding and analysis to build a synthesized version of the

original audio signal. From the decoder, the signal is passed to LPC synthesizer. The function

of LPC synthesizer is as a computerized console or module for creating or scaling the audio

signal (Alwan, 2002). LPC synthesizer will scale the output signal to match the level of the

input signal (McCree, 2008). In other word, if the input signal contains high energy

(amplitude), LPC synthesizer will amplify the output signal to match with the input signal.

The formula for LPC synthesizer is shown in Equation (2.7):

p

s(n) = ∑αk s(n-k) + G u(n) (2.7) k=1

where s(n) is the waveform samples, αk is the predictor coefficient, and G represents the

loudness and it is multiplied with the excitation signal, u(n) to obtain proper loudness

intensity in the excitation signal.

From equation (2.7), the result can be simplified and referred as the Linear Predictive Coding

(LPC) which is shown in Equation (2.8) (Aviv & Grichman, 2011):

s_out(n)

s(n) LPC Analyzer Channel

LPC

Synthesizer

Decoder

18

p

H(z) = G / (1 + ∑ ap (k) z-k ) (2.8) k=1

where p is the order (number of poles), gain G is the signal loudness, and { ap(k) } are the

input parameters for the audio files.

2.3.3 Zero-Crossing Rate (ZCR)

Zero-crossing rate (ZCR) is the rate at which the signal changes from positive to negative or

vice versa. Simply, ZCR is a measure of the number of times in a given frame that the

amplitude of the audio signals passes through a value of zero. The rate at which zero crossing

occurs is a simple measure of the frequency content of a signal. Figure 2.4 shows the block

diagram of ZCR. First, the parameter of audio file such as frame size is assigned. Then, the

amplitude of the audio signal is calculated in energy calculation and the output is called ZCR

which is shown in Equation (2.9):

Figure 2.4: ZCR Block diagram (Raju et al., 2013)

T-1

ZCR = (1/T-1) ∑ || { st st -1<0 } (2.9) t=1

where, s is a signal of length T and the indicator function || {A} is equal to 1 when A is true

and is equal to 0 otherwise.

In order to use ZCR to distinguish unvoiced sounds from noise and the environment,

the waveform can be shifted before computing the ZCR. ZCR has high signal frequency rate

and is much lower for voiced sound as compared to unvoiced sound (Ngo, 2011; Patil et al.,

Input Signal Output Signal Assign Value Calculate

Energy

19

2012). ZCR is an important parameter for voiced or unvoiced classification (Khan, Bhaiya, &

Banchhor, 2012). Unfortunately, the drawback of ZCR is it unable to extract the best quality

of audio features due to the energy calculation process that amplify the noise in the highest

amplitude of an audio signal (Bormane & Dusane, 2013). Therefore, ZCR leads to

degradation of audio classification accuracy (Bormane & Dusane, 2013).

Nevertheless, some improvement can be done to MFCC, LPC and ZCR to improve the

performance of the techniques in audio feature extraction whereby contribute to high quality

of audio features and lead to a better audio classification accuracy. Therefore, this inspired

the use of an equalizer to be integrated with audio feature extraction techniques to equalize

the noise in the highest amplitude of audio signal. The next section will provide a detail

explanation on the noise in audio signal.

2.4 Noise in Audio Signal

Noise can be defined as uncontrolled, loud, unmusical, or unwanted component of an audio

signal (Gold, Morgan & Ellis, 2011; Weeks, 2010; Thorne, 2007). In addition, there are many

types of noise such as voiced noise, unvoiced noise or background noise (Gold, Morgan &

Ellis, 2011). Voiced noise is the sound produced by the human voice such as breathing or

coughing, meanwhile unvoiced noise is the sound produced from the music instrument or

sound effects such as door knocks, paper shuffling or plucking a string instrument (Gold,

Morgan & Ellis, 2011; Gunasekaran & Revathy, 2008). Background noise is any sound other

than the sound being used than can be based on the surrounding or can be considered as an

external sounds such as ambulance going by outside or people talking (Gold, Morgan & Ellis,

2011). In this research, the noise has been identified come from the music instrument itself

due to the vibration of the string that produced noise in the highest amplitude (Dave, 2013;

Stulov & Kartofelev, 2014; Subramanian, 2006). Therefore, some type of noise such as

background noise could be reduced or removed by using noise filtering. Specifically, some

type of noise such as high amplitude noise can only be mitigated by using sound equalizer

(Kaur & Kansal, 2013; Gold, Morgan & Ellis, 2011; Weeks, 2010). The next section will

describe on the noise filtering and sound equalizer in more detail.

20

2.5 Noise Filtering

Noise filtering or also known as noise reduction is a process of removing unwanted

components (noise) from an audio signal (Tan & Jiang, 2013). Noise filtering is normally

applied in audio signal to perform application such as noise reduction, sound crossover and

others (Tan & Jiang, 2013; Altera, 2010). Basically, noise filtering involves the process of

removing different types of noise such as background noise or unvoiced noise from the audio

signal (Gold, Morgan & Ellis, 2011; Weeks, 2010). Moreover, noise filtering is used to filter

a noisy signal, such as cleaning up an audio signal recorded in a room full of other

conversations (Tan & Jiang, 2013). Nevertheless, noise filtering does not focusing on the

specific high amplitude noise while filtering background noise in the audio signal (Tan &

Jiang, 2013; Weeks, 2010). Therefore, as an alternative to the drawback of noise filtering,

this study suggests the use of sound equalizer. The sound equalizer has the potential of

equalizing the noise that come from the high amplitude of an audio signal (Kaur & Kansal,

2013; Kumar & Kaur, 2012; Altera, 2010). The next section will provide a detail explanation

on the sound equalizer.

2.6 Sound Equalizer

Sound equalizer is an essential part of any sound system which provides an approximate

inverse of the channel frequency response (Kumar & Kaur, 2012). Equalizers are used in

recording studios, broadcast studios, and live sound reinforcement to eliminate unwanted

sounds such as noise from microphones, instrument pick-ups, loudspeakers, and hall

acoustics. Figure 2.5 shows an audio equalizer in the communication system. In general, the

audio captured from the audio input source is sent to an equalizer by using a communication

channel (physical medium) such as wires, radio, acoustic, magnetic or optical recording

media. The function of an equalizer is to equalize the amplitude of an audio signal and

transmit to a digital to analog converter (DAC). DAC will convert the digital audio signal to

analog sound and output the analog sound to a speaker or headphone.

21

Figure 2.5: Audio equalizer (Proakis, 2008)

Specifically, equalizer involves the process of equalization to mitigate the effects of

intersymbol interference (ISI). ISI is a form of distortion of audio signal due to an unwanted

sound or effect such as noise which is produced in the high amplitude (Kaur & Kansal, 2013).

Kumar and Kaur (2012) also stated that the reduction of ISI effects has to be stabilized since

the audio signal contains noise. Based on Deepa (2013), ISI arises because of the spreading

of a transmitted pulse due to the dispersive (widely spread or scattered) nature of the channel,

which results in overlapping of adjacent pulses. Equalizer is usually implemented at the

original frequency or the amplitude range of the signal (National Chung Cheng University,

2013). Equalization equation is shown in (2.10):

y(t) = x(t) * f*(t) + nb(t) (2.10)

where

x(t) is the original input signal

f*(t) is the complex conjugate of f(t)

(Changing the sign of the imaginary part and leaving left real part unchanged)

nb(t) is the noise at the input of the equalizer

Equalizer responds to the impulse response of the signal. Impulse response refers to

the reaction of wave in signal response to some external change (Golden, Dedieu & Jacobsen,

2005). High wave in signal refers to the highest amplitude of wave in the channel. The

simplest way to obtain the channel impulse response is to send a single pulse (approximately

an impulse) over the channel and observe the resulting received signal. If there is no noise on

the channel, this could give a good estimate of the channel impulse response. However, in

reality, an estimate of the channel impulse response that is based on a single transmitted

Input

In Out

In Out

In Out

Output

Audio

capture

Communication

channel Equalizer Write to

DAC Audio

Playback

22

impulse will always be noisy. Therefore, the possibility to reduce the effect of the noise is by

taking many measurements of the impulse response (Golden, Dedieu & Jacobsen, 2005).

There are two general categories of equalizer which is linear and nonlinear equalizer.

Altera (2010) stated that linear equalizer is frequency dependent and it can be highly effective

in mitigating the ISI. In summation, a linear equalizer mitigates the ISI of a single audio

signal without enhancing the noise. Some examples of linear equalizer are Zero-Forcing

Equalizer (ZFE) and Minimum Mean Square Error (MMSE) equalizer. On the other hand,

nonlinear equalizer involves the use of mixed-signal (more than one audio signal in the

waveform). MIT Lincoln Laboratory (2009) stated that nonlinear equalizer is used to detect

small signals in the presence of strong background, such as in radar, signal intelligence, and

electronic intelligence systems. According to National Chung Cheng University (2013), some

of the example of nonlinear equalizer is Maximum Likelihood Symbol Detection (MLSD)

and Maximum Likelihood Sequence Estimator (MLSE). This study will only focusing on

linear equalizer that are ZFE and MMSE since linear equalizer highly effective in mitigating

the unwanted sound (noise) in the highest amplitude of audio signal (Kaur & Kansal, 2013;

Kumar & Kaur, 2012; Altera, 2010). The next subsection will discuss ZFE and MMSE in

detail.

2.6.1 Zero Forcing Equalizer (ZFE)

Zero Forcing Equalizer is a linear receiver used in communication systems. This equalizer

inverts the frequency response of the channel. Kumar and Kaur (2012) stated that for a

channel with frequency response F(f), the zero forcing equalizer C(f) is constructed as shown

in Equation (2.11):

C(f) = 1 / F(f) (2.11)

The output of ZFE is calculated from the numerator and denominator coefficient, which is

obtained from the inverse of impulse response in the frequency domain, F(f). Thus, the

combination of channel and equalizer gives a flat frequency response and linear phase as

shown in Equation (2.12):

F(f)C(f) = 1 (2.12)

23

The implementation of ZFE depends on the channel it is used (Proakis, 2008). A channel is

used to convey an information signal, for example a digital bit stream, from one or several

senders (or transmitters) to one or several receivers. In information theory, a storage device is

also a kind of channel, which can be sent to (written) and received from (read). A typical

channel is model in discrete domain as shown in Equation (2.13) (Dytso, 2012):

y[n] = h[n]*x[n] + z[n] (2.13)

where

y[n] is the channel output

h[n] is the channel impulse response

x[n] is the channel input

z[n] is the noise

From equation (2.13), the channel is converted to frequency domain as shown in Equation

(2.14):

y(f) = h(f)x(f) + z(f) (2.14)

ZFE multiplies y(f) and z(f) by inv(h(f)) to reduce ISI as shown in Equation (2.15) (Kaur &

Kansal, 2013):

inv(h(f))y(f) = x(f) + inv(h(f))z(f) (2.15)

In previous research, ZFE has been used to solve the problem of signal transaction in

Multiple Input Multiple Output (MIMO) systems (Khademi et al., 2013). MIMO technology

is a wireless technology that uses multiple transmitter and receiver to improve

communication performance. Hence, ZFE is used to mitigate the interference in signal

transaction. Moreover, ZFE is much more useful for equalizing the effect of noise in the

higher amplitude as introduced by ISI (Kaur & Kansal, 2013; Mobile Communication, 2009).

However, the drawback of ZFE is that the channel response may often exhibit attenuation

(reduction of signal strength during transmission) at high frequencies around one-half the

sampling rate (the folding frequency) (Kumar & Kaur, 2012). Another drawback is that the

use of the equalizer as standalone or independently decreases the performance of the channel.

Therefore, it depends on how ZFE is used as mentioned by Dytso (2012).

24

As mentioned previously, due to the drawback of audio feature extraction techniques

in extracting noisy signal, ZFE is proposed to be integrated with audio feature extraction

techniques since the characteristics of ZFE is to mitigate the effect of noise in the highest

amplitude of the audio signal. Hence, ZFE will be able to improve the performance of audio

feature extraction techniques. Another equalizer known as MMSE, which is in the same

category with ZFE is discussed in the next section.

2.6.2 Minimum Mean Square Error (MMSE) Equalizer

Minimum mean square error (MMSE) equalizer minimizes the mean square error (MSE).

MSE is a common measure of estimator quality as stated by Kumar and Kaur (2012). The

main function of MMSE equalizer is that it does not usually eliminate ISI completely but it

minimizes the total power of the noise in the output. If x is an unknown random variable, then

an estimator of x will be any function from the measurement of known random variable, and

its MSE is given by the trace of error as shown in the simplified equation (2.16):

MSE =E { ( -x )2 } (2.16)

where x is a scalar variable.

According to Cioffi (2008), MMSE can provide better performance if the audio signal

is voice (speech). However, the author pointed out the major drawback of MMSE is that the

equalizer is slightly more complicated to describe and analyze than the ZFE. Also, because of

the biasing (there is an external force that controls the equalizer), the MMSE output is

slightly lower than the ZFE output. MMSE does not assume any stochastic mechanism

(having random variable) of the desired and observed signals (Chen et al., 2013). It only

makes assumptions about the noise. For example, the noise is additive zero-mean, time-

independent, bounded (limited), and known variance. It also does not usually reduce the ISI

effect. Due to the disadvantages of the MMSE, this research is focusing on ZFE. However,

this research will also integrate MMSE with audio feature extraction techniques in order to

compare the performance of both equalizers when they are integrated with the respective

audio feature extraction techniques.

71

REFERENCES

Agarwal, P., Karnick, H., & Raj, B. (2013). A Comparative Study Of Indian And Western Music

Forms. In ISMIR. Brazil: Pontifical Catholic University of Paraná. pp. 29-34.

Altera. (2010). Understanding the Pre-Emphasis and Linear Equalization Features in Stratix

IV GX Devices. Retrieved on January 29, 2014, from www.altera.com

Alwan, A. (2002). Wideband Speech Coding with Linear Predictive Coding (LPC). Retrieved

September 13, 2013, from http://www.seas.ucla.edu

Anusuya, M.A. & Katti, S.K. (2011). Comparison of Different Speech Feature Extraction

Techniques with and without Wavelet Transform to Kannada Speech Recognition,

International Journal of Computer Applications, 26(4), pp. 19-24.

Aslam, T., Doherty, J., Dubrau, A., & Hendren, L. (2010). AspectMatlab: an aspect-oriented

scientific programming language. In Proceedings of the 9th International Conference

on Aspect-Oriented Software Development. France: ACM New York. pp. 181-192.

Atlas, L. (2011). Spectrogram. Retrieved on March 18, 2015, from http://isdl.ee.washington.edu

Aviv, A., Grichman, K. (2011). Long-term prediction. Retrieved on August 30, 2013, from

http://health.tau.ac.il

72

Bello, J. P. (2013). Sound Classification. Retrieved on May 2, 2014, from http://www.nyu.edu

Ben Gal, I. (2007). Bayesian Networks, in Encyclopedia of Statistics in Quality and Reliability,

Ruggeri, F., Kenett, R. S. and Faltin, F. (editors in chief). USA: John Wiley & Sons Ltd.

Bormane, D. S., & Dusane, M. (2013). A Novel Techniques for Classification of Musical

Instruments. In Information and Knowledge Management, 3(10), pp. 1-8.

Buckingham Music Inc. (2008). Veenas. Retrieved on February 4, 2013, from

http://www.buckinghammusic.com/veena/veena.html

Bullock, J. (2008). Implementing audio feature extraction in live electronic music. University

of Birmingham: Ph.D. Thesis.

Chachada, S., & Kuo, C. C. J. (2014). Environmental sound recognition: A survey. APSIPA

Transactions on Signal and Information Processing, 3(14), pp. 1-15.

Chen, J., Ma, T., Chen, W., & Peng, Z. (2013). Unsupervised robust recursive least-squares

algorithm for impulsive noise filtering. Journal Science China Information Sciences,

56(4), pp. 1-10.

Cioffi, M.J. (2008). Chapter 3. Equalization. Retrieved on January 30, 2014, from

http://www.stanford.edu/group/cioffi/book/

Chougule, S. V., & Chavan, M. S. (2014). Channel Robust MFCCs for Continuous Speech

Speaker Recognition. Advances in Signal Processing and Intelligent Recognition

Systems, pp. 557-568.

73

Chu, S., Narayanan, S., & Kuo, C. C. (2009). Environmental sound recognition with time–

frequency audio features. IEEE Transactions on Audio, Speech, and Language

Processing, 17(6), pp. 1142-1158.

Courtney, D. (2012). Gottuvadhyam. Retrieved on February 3, 2013, from

http://chandrakantha.com

Dave, N. (2013). Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition.

International Journal of Research in Engineering and Advanced Technology, 1(6),

pp. 1-5.

Deepa,T. (2013). Linear Equalizers & Nonlinear Equalizers. Retrieved on January 28, 2014,

from http://www.srmuniv.ac.in

Dytso, A. (2012). Zero Forcing Equalizer. Retrieved on January 29, 2014, from

http://www.mathworks.com

Elminir, H. K., ElSoud, M. A., & El-Maged, L. A. (2012). Evaluation of Different Feature

Extraction Techniques for Continuous Speech Recognition. International Journal

of Science and Technology, 2(10), pp. 689-695.

Eronen, A. (2001). Automatic musical instrument recognition. Tempere University of

Technology: Master’s Thesis.

Furht, B. (2009). Handbook of multimedia for digital entertainment and arts. USA: Springer.

74

Gadade, M. H., Jadhav, M. M. R., & Deogirkar, M. S. V. (2010). Speech Identification and

Recognition Using Data Mining. Government College of Engineering: Technical

Report.

Gold, B., Morgan, N., & Ellis, D. (2011). Speech and audio signal processing: processing and

perception of speech and music. USA: John Wiley & Sons.

Golden, P., Dedieu, H. & Jacobsen, K.S. (2005). Fundamentals of DSL Technology. USA:

Auerbach.

Gunasekaran, S. & Revathy, K. (2008). Fractal dimension analysis of audio signals for Indian

musical instrument recognition. International Conference on Audio, Language and

Image Processing (ICALIP). Shanghai: IEEE. pp. 257-261.

Gupta, S., Jaafar, J., Ahmad, W. F., & Bansal, A. (2013). Feature Extraction Using MFCC. An

International Journal of Signal & Image Processing (SIPIJ), 4(4), pp. 101-108.

Jupiter Infomedia Ltd. (2012). Carnatic music. Retrieved on December 24, 2012, from

http://www.indianetzone.com/2/carnatic_music.htm

Jupiter Infomedia Ltd. (2011).Veena. Retrieved on December 24, 2012, from

http://www.indianetzone.com/16/veena.htm

Kamal, S. N., Sarkar, R. N., & Rahman, M. (2011). Comparison of Musical Pitch Analysis

Between LPC and CELP. International Journal of Advances in Engineering Sciences,

1(1), pp. 35-39.

75

Kaur, N., & Kansal, L. (2013). Performance Comparison of MIMO Systems over AWGN and

Rician Channels with Zero Forcing Receivers. International Journal of Wireless &

Mobile Networks, 5(1), pp. 73-84.

Keronen, S., Pohjalainen, J., Alku, P., & Kurimo, M. (2011). Noise robust feature extraction

based on extended weighted linear prediction in LVCSR. In Proceedings of

INTERSPEECH, Italy: ISCA. pp. 1265-1268.

Khademi, S., Chepuri, S. P., Leus, G., & van der Veen, A. J. (2013). Zero-forcing pre-

equalization with transmit antenna selection in MIMO systems. International Conference

on Acoustics, Speech and Signal Processing (ICASSP), Vancouver: IEEE. pp. 5046-5050.

Krishnaraj S. (2010). Carnatic music. Retrieved on January 7, 2013, from

http://www.carnaticindia.com/

Kumar, V., Pandya, H., & Jawahar, C. V. (2014). Identifying ragas in Indian music. 22nd

International Conference on Pattern Recognition (ICPR). Sweden: IEEE. pp. 767-772

Kumar, M., & Kaur, J. (2012). Performance analysis of BPSK system with ZF &MMSE

equalization. International Journal of Latest Trends in Engineering and Technology

(IJLTET).1(3), pp. 127-133.

Kumari, M., Kumar, P. & Solanki, S. S. (2010). Classification of North Indian Musical

Instruments using. Spectral Features. Journal of Computer Science and

Telecommunications (GESJ). 6(29), pp. 11-24.

76

Le-Qing, Z. H. U. (2011). Insect sound recognition based on MFCC and PNN. International

Conference on Multimedia and Signal Processing (CMSP). China: IEEE. pp. 42-46.

Lerch, A. (2012). An introduction to audio content analysis: Applications in signal processing

and music informatics. USA: John Wiley & Sons.

Li, T., Ogihara, M., & Li, Q. (2003). A comparative study on content-based music genre

classification. In Proceedings of the 26th annual international ACM SIGIR conference on

Research and development in information retrieval. Canada: ACM New York. pp. 282-

289.

McCree, A. (2008). Springer Handbook of Speech Processing. India: CBS. pp. 97-105.

McMurray, B. (2004). Experimental Psychology II. Retrieved on August 20, 2014, from

http://www2.psychology.uiowa.edu/

MIT Lincoln Laboratory. (2009). Nonlinear Equalization for Receiver Dynamic Range

Extension. Massachusetts Institute of Technology: Technical Report.

Mobile Communication (2009). Equalization, Diversity and Coding Techniques. Retrieved on

August 30, 2013, from http://mc.lctu.cn

Nasridinov, A., & Park, Y. H. (2014). A Study on Music Genre Recognition and Classification

Techniques. International Journal of Multimedia and Ubiquitous Engineering. 9(4), pp.

31-42.

National Chung Cheng University. (2013). Equalization, Diversity, and Channel Coding.

Retrieved on January 28, 2014, from web.ee.ccu.edu.tw

77

Nettleton, D. F., Orriols-Puig, A., & Fornells, A. (2010). A study of the effect of different types

of noise on the precision of supervised learning techniques. International Journal of

Artificial intelligence review, 33(4), pp. 275-306.

Ngo, K. (2011). Digital signal processing algorithms for noise reduction, dynamic range

compression, and feedback cancellation in hearing aids. University of Leuven: Ph.D.

Thesis.

Panda, A. K., & Sahoo, A. K. (2011). Study of Speaker Recognition Systems. Georgia Institute of

Technology: Ph.D. Thesis.

Peng, L., Yang, D., & Chen, X. (2014). Multi frame size feature extraction for acoustic event

detection. In Asia-Pacific Signal and Information Processing Association, 2014 Annual

Summit and Conference (APSIPA). Cambodia: IEEE. pp. 1-4.

Patil, H. A., Madhavi, M. C., Jain, R., & Jain, A. K. (2012). Combining Evidence from Temporal

and Spectral Features for Person Recognition Using Humming. Proceedings of the First

Indo-Japan conference on Perception and Machine Intelligence. India: Springer. pp. 321-

328.

Prahallad, K. (2011). Speech Technology. Retrieved on April 27, 2014, from

http://www.docstoc.com/docs/109146955/Topic-Spectrogram_Cepstrum-and-Mel-

Frequency-Analysis

Proakis J, Salehi M (2008). Digital Communications. 5th edition. New York: McGraw-Hill.

78

Rahman, R. M., & Afroz, F. (2013). Comparison of Various Classification Techniques Using

Different Data Mining Tools for Diabetes Diagnosis. Journal of Software Engineering

and Applications, 6(03), pp. 85-97.

Raju, N., Arjun, N., Manoj, S., Kabilan, K., & Shivaprakaash, K. (2013). Obedient Robot with

Tamil Mother Tongue. Journal of Artificial Intelligence, 6(2), pp. 161-167.

Rao, B. T., Chinnam, S., Kanth, P. L., & Gargi, M. (2012). Automatic Melakarta Raaga

Identification Syste: Carnatic Music. International Journal of Advanced Research in

Artificial Intelligence (IJARAI), 1(4), pp. 43-48.

Rao, V., Gupta, C., & Rao, P. (2013). Context-aware features for singing voice detection in

polyphonic music. In Adaptive Multimedia Retrieval. Large-Scale Multimedia Retrieval

and Evaluation. Spain: Springer. pp. 43-57.

Rocha, B., Panda, R., & Paiva, R. P. (2013). Music Emotion Recognition: The Importance of

Melodic Features. 6th International Workshop on Machine Learning and Music. Czech

Republic: Springer. pp. 1-4.

Sathish, V M. (2010). South Indian star turns musician for Sharjah show. Retrieved on

February 3, 2013, from http://www.emirates247.com

Sheetal, G. D., & Raut, R. D. (2012). Advance Source Coding Techniques for Audio/Speech

Signal: A Survey. International Journal of Computer Technology & Applications, 3(4),

pp. 1335-1342.

79

Smith, S. (2013). Digital Signal Processing: A Practical Guide for Engineers and Scientists: A

Practical Guide for Engineers and Scientists. USA: Newnes.

SoundCloud (2007). Retrieved January 7, 2013, from https://soundcloud.com/

Stulov, A., & Kartofelev, D. (2014).Vibration of strings with nonlinear supports. Journal of

Applied Acoustics, 76(14), pp. 223-229.

Subramanian, M. (2006). Carnatic Music and the Computer. Retrieved on March 12, 2014, from

http://www.musicresearch.in

Sukor, A. S. (2012). Speaker identification using MFCC procedure and noise reduction method.

Universiti Tun Hussein Onn Malaysia: Master’s Project Report.

Tan, L., & Jiang, J. (2013). Digital signal processing: fundamentals and applications. USA:

Academic Press.

The Incredible India Travel (2011). Musical Instruments. Retrieved on January 5, 2013,

from http://www.theincredibleindiatravel.com

Thorne, R. (2007). Assessing intrusive noise and low amplitude sound. Massey University: Ph.D.

Thesis.

Umapathy, K., Ghoraani, B., & Krishnan, S. (2010). Audio signal processing using time-

frequency approaches: coding, classification, fingerprinting, and watermarking.

EURASIP Journal on Advances in Signal Processing, 10(1), pp. 1-28.

80

Umapathy, K., Krishnan, S. & Rao, R. K. (2007). Audio Signal Feature Extraction and

Classification Using Local Discriminant Bases. IEEE Transactions on Audio, Speech &

Language Processing, 15(4), pp. 1236-1246.

Walter, S., Scherer, S., Schels, M., Glodek, M., Hrabal, D., Schmidt, M., & Schwenker, F.

(2011). Multimodal emotion classification in naturalistic user behavior. In Human-

Computer Interaction. Towards Mobile and Intelligent Interaction Environments (pp.

603-611). Springer Berlin Heidelberg.

Weeks, M. (2010). Digital Signal Processing Using MATLAB and Wavelets. 2nd ed. USA: Jones

and Bartlett.

WEKA (2010). Retrieved February 3, 2013, from http://www.cs.waikato.ac.nz/ml/weka/

Witten, I., Frank, E., & Hall, M. (2011). Data Mining: Practical Machine Learning Tools and

Techniques. 3th edition. USA: Morgan Kaufmann.

Wolf, M. & Nadeu, C. (2008). Evaluation of different feature extraction methods for speech

recognition in car environment. 15th International Conference on Systems, Signals and

Image Processing (IWSSIP), Bratislava: IEEE. pp. 359-362.

Xie, C., Cao, X., & He, L. (2012). Algorithm of Abnormal Audio Recognition Based on

Improved MFCC. International Workshop on Information and Electronics Engineering

(IWIEE), China: Elsevier. pp. 731-737.

81

Yu, Y., Zimmermann, R., Wang, Y., & Oria, V. (2013). Scalable Content-Based Music Retrieval

Using Chord Progression Histogram and Tree-Structure LSH. IEEE Transactions on

Multimedia, 15(8), pp. 1969-1981.

Zhu, Q., & Alwan, A. (2003). Non-linear feature extraction for robust speech recognition in

stationary and non-stationary noise. Journal of Computer Speech & Language, 17(4), pp.

381-402.

Zytrax (2014). Digital Sound Primer. Retrieved on August 27, 2014, from

http://www.zytrax.com

Documents

IMPROVEMENT OF AUDIO FEATURE EXTRACTION …eprints.uthm.edu.my/7917/1/KOHSHELAN_SUNDARARAJOO.pdf · disebabkan oleh getaran daripada alat muzik tradisional India khususnya alat muzik