FILLED PAUSE CLASSIFICATION USING ENERGY-BOOSTED MEL-FREQUENCY CEPSTRUM COEFFICIENTS (MFCC)

FILLED PAUSE CLASSIFICATION USING

ENERGY-BOOSTED MEL- FREQUENCY

CEPSTRUM COEFFICIENTS (MFCC)

Raseeda Hamzah, Nursuriati Jamil,Noraini Seman

Faculty of Computer and Mathematical Sciences,

MARA University of Technology, Shah Alam, 40450 Selangor,

Malaysia.

[email protected], [email protected], [email protected]

Abstract.Filled pause is one type of disfluency, identified as the often occurred

disfluency in spontaneous speech and known to affect Automatic Speech

Recognition accuracy. The purpose of this study is to analyze the impact of

boosting Mel-Frequency Cepstral Coefficients with energy feature in

classifying filled pause. A total of 828 filled pauses comprising a mixture of 62

male and female speakers are classified into /mhm/, /aaa/ and /eer/. A back-

propagation neural network using fusion of gradient descent with momentum

and adaptive learning rate is used as the classifier. The results revealed that

energy-boosted Mel-Frequency Cepstral Coefficients produced a higher

accuracy rate of 77% in classifying filled pauses.

Keywords: Malay Filled Pause, Energy, Mel-Frequency Cepstral Coefficients,

Energy-Boosted MFCC, Artificial Neural Network, Gradient Descent

Momentum

1 Introduction

An Automatic Speech Recognition (ASR) system mainly processes read speech or

spontaneous speech and converts them into text. ASR for read speech is considerably

less complicated than spontaneous speech. One of the many hurdles of processing

spontaneous speech is the occurrence of disfluencies such as filled pause, repetition

and sentence restart; filled pause being the highest occurred disfluency. In this paper,

spontaneous speech is defined as unprepared speech, in opposition to read speech

where utterances contain formal, well-formed sentences close to those that can be

found in written documents. Many studies have focused on the detection and the

correction of these disfluencies. Our research focuses on classification of filled pauses

as it is the most common type of disfluencies occurring in spontaneous speeches.

Mel-Frequency Cepstral Coefficients (MFCC) feature is one of the best known and

most commonly used features for speech recognition (Mahesha and Vinod, 2012).It is

a type of speech feature involving coefficients that represent audio which are derived

from a type of cepstral representation of the audio clip (Rosdi, 2008).MFCC has been

successfully used in recent speech processing related work such as in non-speech

detection of dysarthric speech (Hu, 2009); isolated spoken speech recognition

(Seman, 2012) and continuous speech recognition (Horia, 2011). Prosodic features are

also widely used in acoustical-based filled pause detection. Energy, pitch and formant

frequency are three prosodic features used in filled pause detection by Gard and Ward

(2006), Kaushik et al., 2010 and Veiga, 2011.

In this research, energy is chosen as the acoustical feature to be integrated with

Mel Frequency Cepstral Coefficient (MFCC). Energy is chosen rather than pitch and

formant frequency because pitch shows no greater result of performance (Stouten et

al., 2006). Formant frequency is said to be better than pitch (Audkhasi, 2010).

However, formant frequency is highly computational as it involves complicated

calculations. Energy is a very good acoustical representation of noisy data as the

energy of the speech is higher compared to energy of speech with background noise

(Sakhnov et al., 2009).

Several well-known modeling methods such as Hidden Markov Model (HMM),

Gaussian Mixture Model (GMM), Support Vector Machine (SVM) and Artificial

Neural Network (ANN) have been used as classifiers in ASR. Among these

classifiers, ANN is the most efficient in speech recognition as stated in previous work

by (Majeed, 2012) and (Dede and Sazli, 2010).ANN also has the ability to generalize

after learning from the sample given to its network. The generalization ability makes

ANN to properly understand the hidden part of the population even if the sample data

contain noisy information (Zhang et al., 1998). The raw data used in this research is

gathered from Malaysia Parliamentary Hansard Document (MPHD) debate session of

the year 2008. Since the spontaneous speech data is recorded live, it is surrounded

with background noise, interruption, and various speaking style (low, medium and

high intonation). Therefore, ANN is preferred as the classifier model in this research.

1.1 Data Preparation and Methodology

The Malaysia Parliamentary Hansard Document (MPHD) spoken data used in this

research consists of 51 video files (.avi) and 42 text files (.pdf). Each session of the

debate contains six to eight hours of spoken speeches (Seman et al., 2010). It

comprises 253 topics with 222 selected male and female speakers. However, for the

purpose of this research the selection of 10 topics with 35 speakers are used. Thus,

the data collection contains 828 filled pauses. The Malay filled pauses that are

analyzed in this research are defined as /MHM/, /AAA/ and /EER/. Based on the

filled pause collection, there are 82 /MHM/, 484 /EER/ and 272 /AAA/.

For the initial stage of data collection, 562 sentences have been extracted from

the audio MPHD. Filled pause from each sentence is manually segmented and is

grouped based on its type (/EER/, /AAA/, and/MHM/) and labeled as Manual Datasets

(MDs).

1.2 Pre-processing

Pre-processing is a crucial step in signal processing especially in automatic speech

recognition. Without a proper pre-processing on the recorded speech input, the

classification performance or recognition rate will be decreased. In general, the pre-

processing operations involved in this research are filtering, pre-emphasis, frame

blocking, and windowing and feature extraction. The input of the speech waveform is

initially sampled at a sampling rate of 16 kHz. Then, it is filtered using high pass filter

to suppress the noises and accentuate only the high frequency speech waveform.

A pre-emphasis process is then applied to the input speech waveform. The pre-

emphasis is defined by:

]1[*][][ nsPnsny (1)

where, s[n] is the nth

speech sample, y[n] is the corresponding pre-emphasized sample

and P is the pre-emphasis factor typically having a value between 0:9 and 1. The input

speech waveform then will pass through frame blocking and windowing processes.

The common Hamming Window is applied by applying the formula:-

0

0)),1(|2cos(46.054.0 NnNn (2)

1.3 Features Extraction

The front- end signal processing used in ASR is feature extraction. Feature extraction

involved a process of converting the speech waveform to some type of parametric

representation.

1.3.1 Energy

Let )(ix be the thi sample of a speech signal. If the length of the frame is N samples,

then thethj frame can be represented as Eq. 3.

(3)

The most common way of calculating energy of a speech signal is by calculating its

frame by frame energy (Sakhnov et al., 2009) as shown Eq.4.

Nj

Nji

ixN

Ej

.

1).1(

2 )(.1

(4)

Where Ej Energy of the thj frame and fj thj frame is the considered frame

where the energy is calculated.

Otherwise

Nj

Njiixfj.

)1).(1()(

The energy for 82 /MHM/, 484 /EER/ and 272 /AAA/ are calculated and the average

energy of each filled pause is shown in Table 1. As can be seen, the energy value of

/AAA/ is much lower than /EER/ and /MHM/. However, /MHM/ and /EER/ energy

values are similar.

Table 1. The average energy of each type of filled pause

Type of filled pause Energy

/AAA/ 0.69

/EER/ 10.15

/MHM 11.66

1.3.2 Mel-Frequency Cepstral Coefficient Extraction

The process flow of MFCC extraction from the input speech is shown in Fig.1.In

order to compute the MFCC, a Discrete Fourier Transform (DFT) is performed on

each of the windowed speech waveform with 512 DFT. DFT is used to calculate

short-term power spectrum of speech signal (Sahidullah, 2012).The Mels-scale is then

calculated by using Eq. 5.

7001log.2595)( 10 f

fmel (5)

The derived log amplitudes of the spectrum are mapped onto the Mel scale using

triangular overlapping windows. After doing some analysis, a 12-cepstral coefficient

of MFCC is chosen. The Discrete Cosine Transform (DCT) of the list of Mel log

amplitudes is further calculated. The MFCCs are the amplitudes of the resulting

spectrum.

Fig.1. The MFCC extraction process

1.3.3 Data Normalization

Data normalization needs to be performed prior to feeding into Artificial Neural

Network model (Seman, 2012). Raw recorded input speech is in the form of unequal

waveform, length and amplitude. In this research, data normalization is done by

Pre-processing Discrete Fourier

Transform

Mel-frequency

wrapping

Discrete Cosine

Transform

Mel-Frequency

Cepstral

Coefficients

Logarithm

taking the mean of each MFCC and energy features extracted from the Manual

Datasets filled pause.

1.4 Classification and Evaluation

Multilayer Perceptron (MLP) is one of ANN’s architecture. In this research, MLP is

chosen because its ability to function well for non-linear phenomena (Ayoubi et al.,

2011). The basic architecture of Multilayer-Perceptron (MLP) is as shown in Fig.2.

Multi-layer neural networks with the backpropagation algorithm are used in this

research to classify the three types of filled pause /AAA/, /EER/ and /MHM/. A three-

layer feedforward with three different structure neural network with one hidden and

one output layer is applied. The three structures that is implemented in this research

use same inputs, target assignments, activation functions, output layer structure,

network parameters, and differ only in the number of hidden neurons (HN) in the

hidden layers.

Fig.2. Architecture of MLP-ANN

The output layer of 3 neurons is corresponds to one filled pause of each in the three-

layer feed-forward network structure. Each filled pause is assumed to correspond to a

class, and each filled pause belonging to its respective class is labeled with an integer

number from one to three for the classification purpose. The summarized structure of

MLP used in this research is shown as in Table 2.

Table 2. The MLP-ANN structure of the research

Hidden neuron Features Training Function

MFCC Energy-boosted MFCC

7 12 13 GDM

12 12 13 GDM

24 12 13 GDM

The number of input layer is chosen by calculating by multiplying the cepstral order

with the total frame number as below:

)/_(/___ ShiftlengthFrameShiftlengthSignalNumberFrameTotal (6)

NumberFrameTotalorderCepstralNumberNeuronInput __*___ (7)

Output layer

Hidden layer

Input layer X1

X2

X3

X(n)

In order to evaluate the performance each MLP structure used in this research,

different numbers of hidden neurons in the hidden layers were going through a trial an

error process. The number of hidden neuron cannot be too many, otherwise, it cannot

obtain good convergence rate (Seman, 2012). A too small number of hidden neuron

will causes large classification error .One of the guideline for choosing the number of

hidden neuron is by implementing Geometric Pyramid Rule (GPR) that is calculated

by the following equation

2/1)_*_( numberOutputnumberInputHNN (8)

The training algorithm that has been chosen is Gradient Descent with Momentum

(GDM) which is discussed further in the following subsection.

1.4.1 Training the Artificial Neural Network

Training of MLP-ANN is done using the standard Gradient Descent (GD) algorithm.

The training goal is set to 0.001 and to achieve this goal, 1,800 iterations of gain 300

are done. One of the disadvantage of GD is it uses a fix step-size (Gong et al., 2012).

This procedure makes the convergence time longer causing insufficient training on

MLP-ANN. To overcome this limitation, GDM is chosen as the training algorithm in

this research due to its ability to provide variable step size thus increases the

convergence time (Gong et al., 2012).

1.4.2 Testing and Evaluation

In this research, a ratio of 70% for training data and 30% of testing data is used in the

experiment. Accuracy rate is calculated as in Eq. 9 to measure the closeness of the

tested data to the actual data.

dataActual

datatesteddataActualAccuracy

_

__ (9)

Another evaluation is done to assess the maximum acceptance of the filled pauses in

ANN model. This is done by calculating the overall performance scores of true

positive and false positive rates of the filled pauses. True positive is defined as the

rate of correctly accepted filled pause while false positive is defined as incorrectly

rejected filled pause in ANN. Eq. 10 shows the equation of score measurements.

3.0*)__1(7.0*__ ratepositiveFalseratepositiveTrueScore (10)

2 Results

Table 3 shows the results of filled pause classification using MFCC and Table 4

presented the results of classification using energy-boosted MFCC. As can be seen,

the results showed that energy-boosted MFCC outperformed classification done by

MFCC only. Among the three types of filled pauses, the highest accuracy rate is

achieved by energy-boosted MFCC for /AAA/ filled pause with average of 77%

accuracy rate and 63% score. The average energy of /AAA/ filled pause is much

lower than /EER/ and /MHM/ as tabulated earlier in Table1. The integration of MFCC

and energy on each filled pause boosted the pattern of the feature vector, thus

increasing the discriminations of patterns among the filled pauses.

Table 3. Performance result of MFCC-based feature

HN Epoch Time Performances

MSE Accuracy (%) Score (%)

AAA EER MHM TP FP Score

7

300 0:00:02 0.029500 60 60 20 53 47 53

600 0:06:36 0.009920 60 40 20 40 60 40

900 0:12:21 0.008910 60 40 40 47 53 47 1200 0:10:03 0.000996 20 40 20 47 53 47

1500 0:13:11 0.000997 80 40 40 53 47 53

12 300 0:03:03 0.041100 40 60 20 40 60 40

600 0:07:27 0.001750 80 40 60 60 40 60

900 0:10:20 0.006560 60 40 40 47 53 47

1200 0:16:48 0.000999 80 40 20 47 53 47

24 300 0:03:11 0.029700 40 60 20 40 60 40

600 0:08:02 0.018100 60 60 20 47 53 47

900 0:12:14 0.003690 40 60 60 53 47 53

1200 0:15:13 0.001770 80 40 20 47 53 47

1500 0:17:06 0.000999 40 60 40 47 53 47

Table 4. Performance result of Energy-boosted MFCC

HN Epoch Time Performances

MSE Accuracy (%) Score (%)

AAA EER MHM TP FP Score

7 300 0:03:36 0.022000 80 60 40 60 40 60

600 0:09:39 0.004800 40 60 40 53 47 53

900 0:12:54 0.002190 60 80 40 60 40 60 1200 0:11:32 0.000998 80 80 40 67 33 67

12 300 0:03:18 0.030300 60 80 40 60 40 60

600 0:06:25 0.001470 100 60 40 67 33 67

900 0:09:42 0.003320 100 80 40 73 27 73

1200 0:12:37 0.001880 100 60 40 67 33 67 1500 0:15:49 0.001100 100 80 40 73 27 73

1800 0:12:28 0.001000 80 60 40 60 40 60

24 300 0:03:21 0.015900 80 60 40 67 33 67

600 0:06:48 0.001300 60 60 40 53 47 53

900 0:10:29 0.000997 60 20 80 53 47 53

The ratio of average accuracy for each type of filled pause is calculated by

finding its accuracy average differences among filled pause types. The purpose of

calculating the ratio is to show the impact of energy-boosted MFCC towards the

classification of filled pause. The ratio of average accuracy between all types of filled

pauses is summarized in Table 5. From the table, it shows that the ratio becomes

larger as MFCC features of the filled pauses are integrated with energy. Therefore,

improving the classification accuracy of each filled pause types.

Table 5. The ratio of average accuracy for MFCC and energy-boosted MFCC feature

Filled pause MFCC MFCC+ENERGY

/AAA/ : /EER/ 8 12

/AAA/: /MHM/ 26 34

/EER/ : /MHM/ 18 22

The summarized comparison performance rate is described as in Fig.3, Fig.4 and

Fig.5 graphically shows the average performance comparisons of MFCC and energy-

boosted MFCC in terms of accuracy, score and Mean Square Error (MSE). MSE is

the standard criterion for the assessment of signal quality (Wang and Bovik, 2009).

The Mean Square Error (MSE) of energy-boosted MFCC is lower than MFCC-based

features as shown in Fig.3. Therefore, the ANN classifier produced higher accuracy

rate since the error is smaller as illustrated in Fig 4. The score of the ANN

performance that is used in this research is also calculated to show the acceptance of

filled pause throughout the ANN. It is proven that the energy-boosted MFCC

achieved higher score rate as exemplified in Fig.5.

Fig.3. The average MSE

Fig.4. The average accuracy

Fig. 5. The average Score

Acknowledgement: The authors thankfully acknowledge Ministry of Higher

Education Malaysia for Fundamental Research Grant Scheme (FRGS, Grant No: 600-

RMI/FRGS 5/3(48/2013) and MARA University of Technology for providing

research facilities throughout this research.

References

1. A.Veiga, S.Candeias,L. Carla,P. Fernando :Characterization of Hesitations Using

Acoustic Models. In: ICPhS XVII, Hong kong (2011)

2. F.Rosdi, R.N Ainon :Isolated Malay speech recognition using Hidden Markov

Models. Proceedings of the International Conference on Computer and

Communications Engineering (ICCCE08), Malaysia, pp.721-725,(2008)

3. F.Stouten: Coping with disfluencies in spontaneous speech recognition: Acoustic

detection and linguistic context manipulation. Speech Communication, 48(11): 1590-

1606.(2006)

4. G.Dede and M.H.Sazli: Speech Recognition with Artificial Neural Network. Digital

signal Processing 20(3),763-768 (2010)

5. G.Garg,N. Ward,:Detecting Filled Pauses in Tutorial Dialogs. In Report of University

of Texas at El Paso, El Paso (2006)

6. G.P.Zhang: Neural networks for classification: a survey. IEEE Transactions on

Systems, Man, and Cybernetics - Part C: Applications and Reviews; (2000). pp. 451–

462.

7. I.C.Horia: Towards a speaker-independent, large-vocabulary continuous speech

recognition system for Romanian. PhD. Thesis. University of PolitehnicaDin

Bucureşti (2011)

8. K.Audhkhasi, :Formant-Based Technique For Automatic Filled-Pause Detection In

Spontaneous Spoken English. In: Acoustics, Speech and Signal Processing, IEEE

International Conference ICASSP (2009)

9. L.Gong, C. Liu ,Y.Li , F.Yuan:Training Feed-forward Neural Networks Using the

Gradient Descent Method with the Optimal Stepsize. Journal of Computational

Information Systems 8: 4 (2012) 1359-1371

10. M.Kaushik :Automatic detection and removal of disfluencies from spontaneous

speech. In: Proc. 13th Australasian Int. Conf. on Speech Science and Technology

Melbourne, 98-101, 2010.

11. S. A., Majeed, H. Husain,S. A. Samad, and A. Hussain,Hierarchical K-Means

Algorithm Applied on Isolated Malay Digit Speech Recognition. International

Conference on System Engineering and Modeling (ICSEM 2012) IPCSIT vol. 34

IACSIT Press, Singapore(2012)

12. S.Karsoliya: Approximating Number of Hidden layer neurons in Multiple Hidden

Layer BPNN Architecture. International Journal of Engineering Trends and

Technology- Vol. 3(2012)

13. Seman: Coalition of artificial intelligent (AI) algorithms for isolated spoken Malay

speech recognition. PhD. Thesis. UniversitiTeknologi Mara, Shah Alam (2012)

14. Y.Hu: Detecting Non-Speech in Dysarthric Speech. Master. Thesis. University of

Sheffield (2009)

15. Z. Wang and A.C. Bovik: Mean Square Error: Love It or Leave It? Ieee Signal

Processing Magazine (2009)

16. P. Mahesha and D. S. Vinod: Feature based classification of dysfluent and normal

speech. In Proceedings of the Second International Conference on Computational

Science, Engineering and Information Technology (CCSEIT '12). ACM, New York,

NY, USA, 594-597(2012)

Documents

FILLED PAUSE CLASSIFICATION USING ENERGY-BOOSTED MEL-FREQUENCY CEPSTRUM COEFFICIENTS (MFCC)