12
Continuous Telugu Speech Recognition through Combined Feature Extraction by MFCC and DWPD Using HMM based DNN Techniques 1 Archek Praveen Kumar, 2 Ratnadeep Roy, 3 Sanyog Rawat and 4 Prathibha Sudhakaran 1 [email protected] 2 [email protected] 3 [email protected] 4 [email protected] Abstract ASR means automatic speech recognition which deals with recognition of human speech for proper and reliable communication between human and machine. Speech recognition is actually done in two phases, one is speaker dependent and second one is speaker independent. Speaker dependent recognition systems are used in biometric applications to identify particular user. Speaker independent is used to convert speech to text for identifying the language. This research deals with speaker independent system for Telugu language to recognize continuous words. Emotion and accent vary in Telugu language. Continuous speech recognition is done by combined MFCC (Mel frequency Cepstralcoeffients) and DWPD (Discrete wavelet packet Decomposition) techniques. DWT (Discrete Wavelet Transformation) is used for De-noising and pre-processing of recorded speech. The extracted features are classified by highly efficient technique HMM (Hidden Markova Model) based DNN (Deep Neural Networks). DNN is advanced technology over ANN which gives greater recognition accuracy. Word model based approach is used while recognizing the continuous data. This simulation deals with word based context independent acoustic model with 446 uncommon words. Key Words:Continous speech recognition, Telugu language, MFCC, DWPD, DWT, HMM, DNN. International Journal of Pure and Applied Mathematics Volume 114 No. 11 2017, 187-197 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 187

Continuous Telugu Speech Recognition through Combined …acadpubl.eu/jsi/2017-114-7-ICPCIT-2017/articles/11/19.pdf · 2018-03-15 · on Telugu speech recognition with compression

Embed Size (px)

Citation preview

Continuous Telugu Speech Recognition through

Combined Feature Extraction by MFCC and DWPD

Using HMM based DNN Techniques 1Archek Praveen Kumar,

2Ratnadeep Roy,

3Sanyog Rawat and

4Prathibha Sudhakaran

[email protected]

[email protected]

[email protected]

[email protected]

Abstract ASR means automatic speech recognition which deals with recognition of

human speech for proper and reliable communication between human and

machine. Speech recognition is actually done in two phases, one is speaker

dependent and second one is speaker independent. Speaker dependent

recognition systems are used in biometric applications to identify particular

user. Speaker independent is used to convert speech to text for identifying the

language. This research deals with speaker independent system for Telugu

language to recognize continuous words. Emotion and accent vary in Telugu

language. Continuous speech recognition is done by combined MFCC (Mel

frequency Cepstralcoeffients) and DWPD (Discrete wavelet packet

Decomposition) techniques. DWT (Discrete Wavelet Transformation) is used

for De-noising and pre-processing of recorded speech. The extracted features

are classified by highly efficient technique HMM (Hidden Markova Model)

based DNN (Deep Neural Networks). DNN is advanced technology over

ANN which gives greater recognition accuracy. Word model based approach

is used while recognizing the continuous data. This simulation deals with

word based context independent acoustic model with 446 uncommon words.

Key Words:Continous speech recognition, Telugu language, MFCC,

DWPD, DWT, HMM, DNN.

International Journal of Pure and Applied MathematicsVolume 114 No. 11 2017, 187-197ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

187

1. Introduction

The Speech is one dimensional signal. Speech Processing deals with many more

factors like frequency and cepstral parameters. Speech recognition systems are

mainly classified into speaker dependent systems and speaker independent

systems [1]. Speaker dependent systems are used to identify a particular speaker

which is used in security and biometric applications. Speaker independent

systems are used to convert the speech to text especially used for audio subtitle

applications. Research in speech is done with recognition of language for

alphabets, isolated words and continuous words [2]. Many researchers worked

on Telugu language from past few years but worked on either recognition in

alphabets or isolated words but continuous words are the keen important area to

be considered to design a perfect recognition system. Speech recognition is

basically done on three approaches, word based approach, phone based

approach and syllable based approach. This paper deals with word based

approach. The word is continuous word which is with a large vocabulary.

Design of continuous speech word based system with greater accuracy is a

challenging task for the researchers. Speech recognition is most popular

research area where many European and Russian languages are recognized and

even it was well developed for Asian languages like Chinese etc., but Indian

languages speech recognition is still in beginning phase. Specifically there is a

lot of scope for research in south Indian language Telugu. Few researchers

attempted to recognize the Telugu language words by different feature

extraction and classification techniques with different recognition accuracy.

Researchers Archek et.al, attempted Telugu speech recognition by combining

MFCC and arithmetic coding by using ANN with promised recognition

accuracy [3]. In the same way Author Sonia sunny etal worked on wavelet

based feature extraction technique for isolated words on Malayalam language

with esteemed results [4]. Again same author worked on Malayalam language

recognition for the same isolated words but with different classifier called Naive

bays classifier for better recognition [5]. Researcher Archek et.al, again worked

on Telugu speech recognition with compression of data by using adaptive

Huffman coding and also proved that recognition can also be achieved by Vitter

algorithm [6]. Funded research is done on Malayalam speech recognition and

compared with all the recognition classifiers and stated that ANN was better

than SVM and Naive Bayes by author [7]. Combined feature extraction

techniques by MFCC and MODGDF used by author Rajesh etal where he

worked for continuous words for good accuracy around 86 to 87 % [8]. As

similar to author Rajesh there is another paper worked on combine feature

extraction technique by MFCC and MOGDF but classification technique used

was Naive Bayes where recognition accuracy was quite high with 93.76 % [9].

Government funded project done by profThangarajan and group worked on

word and Triphone based approaches for recognition of Tamil language for

continuous words with challenging results with consideration of various factors

like accuracy, sentences, time, speed memory [10]. Author Rlkvenkateshwarulu

International Journal of Pure and Applied Mathematics Special Issue

188

developed a system to recognized Telugu speech but it is based on letter

approach. [11]. Lastly Incremental of maximum posterior is estimated by HMM

designed by authors Gotoh and Hochberg [12].Reviewing all these papers the

research in this paper deals with word based approach for continuous words by

combining MFCC, DWPD techniques and highly recommended deep neural

networks (DNN) which is based on HMM.

2. Block Description

Basically speech recognition has a flow where the detailed process is shown in

figure 1. Firstly the speech data base is collected where continuous speech is

recorded. The recorded speech is preprocessed also called as front end analysis

where sampling rate, windowing, softening are done. Now the preprocessed

signal is simulated to extract the features by using joint MFCC and DWPD

techniques. Lastly the extracted features are classified by using HMM based

DNN technique.

Fig. 1: Block diagram

A. Telugu Language

Telugu is the famous languages in south India. Telugu language is evolved from

Dravidian language. Telugu language has 60 letters where all utterance in

Telugu language ends with a vowel. The slang of Telugu is quite different

which accounts various considerations called as Acchulu and Allulu. Telugu is

also called as a classical language of India. Telugu is précised or authorized

language for states Andhra Pradesh and Telangana. Telugu to English

conversion is done by IPA format. 16 vowels, 3 vowel modifiers and 41

consonants are available in Telugu language [13].

B. Speech Database

Speech database is created by recording the speech signal in silent environment

since speech signals are easily affected by noise. Continuous speeches of 446

uncommon words are recorded by 20 speakers. This recording is done at a

sampling rate of 16 kHz. Speech is recorded in wave files which are later

converted to mono speech. Microphone should be suitable such that it converts

pressure variation of speech to electric signals.

C. Preprocessing or Front end of Processing

Front end analysis is important for every speech recognition system. As the

recorded speech signal is an integration of speech and noise so there are few

Speech Recording Speech Database Joint Feature

extraction by MFCC

and DWPD Feature Classification by

HMM and DNN Speech Recognition

International Journal of Pure and Applied Mathematics Special Issue

189

steps which are followed in preprocessing as shown in figure 2

Fig. 2: Pre Processing flow diagram

DWT (discrete wavelet transformation) is an appropriate technique used for De-

noising by many researchers. Hard Thresholding and soft Thresholding are the

two techniques can be used in DWT. This research uses soft threshold with 8

levels which almost all removes the noise. This preprocessing includes

segmenting accompanying framing, windowing which uses hamming

windowing technique.

D. Feature extraction

Feature extraction is major part in speech recognition. MFCC and WPD are the

most important techniques used for feature extraction. MFCC works on cepstral

domain. Vectors are extracted from these techniques. Features are extracted

individually and jointly. Joint feature extraction is one of the interesting areas

where more features are extracted compared to individual techniques.

MFCC

MFCC is Mel frequency cepstral coefficients. Short time power spectrum of

speech is calculated. 12 coefficients are determined by cepstral domain. Mel is a

scale where band of frequencies are spaced equally. Normal cepstrum

frequency bands are linearly spaced but Mel have equally spaced bands. This

equaling or wrapping of frequencies are useful to extract the features. The

Segmentation Speech

Frame blocking

windowing

Time alignment

8 level DWT for De-noising

International Journal of Pure and Applied Mathematics Special Issue

190

detailed process of MFCC is shown in figure 3. The speech signal is

transformed by using Fast Fourier transform (FFT) [14]. The frequency

spectrum is Mel scaled and filtered by using mel scale filtering. Log value of

Mel scale filtered frequency is calculated and DCT (Discrete cosine

transformation) is applied. The discrete coefficients are calculated by DCT. The

derivative of DCT coefficients derives the vectors which are also called as

features [15] [16].

Fig. 3: MFCC block diagram

DWPD

Discrete wavelet packet decomposition is integration of discrete wavelet

transformation (DWT) and wavelet packet decomposition (WPD). Wavelets are

highly useful technique for speech recognition in terms of de-noising and

feature extraction. Wavelet separates high frequency and low frequency sub

bands. Low frequency consists of features or characteristics of signals and high

frequency consists of noise content. Wavelet is to diagnose in to low frequency

component in to maximum levels in this research considered as 8 levels. High

frequency component sometimes have noise and information so this is a very

big disadvantage to apply DWT which can be overcome by WPD but this

increases the computational complexity. DWPD is a combination of DWT and

WPD which removes the complexity and enhances the speech for the extraction

of features. The detailed view of DWT, WPD and DWPD is shown in figure 4

from left to right.

Fig. 4: DWT, WPD and DWPD tree diagram

The process of DWPD is firstly the speech signal is split in to low frequency

and high frequency which is done in first level of decomposition. The last 7

scales of DWT are composed on low frequency component and WPD on high

frequency component. The features obtained by the DWT and WPD are

combined and maximum vectors sets are generated.

International Journal of Pure and Applied Mathematics Special Issue

191

Joint features from MFCC and DWPD are generated and feature vector sets are

stored for classification. If the features extracted are more and efficient more

will be the recognition accuracy.

E. Feature Classification

Features extracted are classified for recognizing. There are numerous classifiers

used for recognition of speech. Feature classification can be done by either

pattern recognition or vector analysis or artificial neural networks. This paper

deals with advanced version of ANN called as deep neural networks based on

HMM [17]. Speech recognition system involves statistical measures like

conditional probabilities form the model to design posterior probabilities. HMM

normalizes the speech signals with respect to time thus it helps to characterize

the conditional probabilities. This paper designed HMM based DNN system for

continuous Telugu speech. A system for continuous speech of Telugu language

is by CI based word model approach. Every word is mapped accordingly.

Repeated words have different pronunciation so each word is indexed properly

for further processing. Classifications in deep neural networks are done by two

phases, training phase and testing phase. 40% of collected data is used for

training phase and 60 % for testing phase. ANN based HMM is used for design

but the recent approach for the model is DNN based HMM which improves

recognition accuracy by sorting LVSR problems [18].

DNN use deeper expressive neural network frame which consists of DBN

(Deep belief network) algorithm which makes the training data more suitable.

DNN also uses posterior statistical probabilities according to the word based

approach. DNN base HMM is Hybrid architecture of senones and can be easily

suitable for continuous speech. Training with senones allows more data to be

represented in training network. The detailed DNN based HMM is shown in

figure 5. LVSCR system is considered to build word based architecture.

Fig. 5: HMM based DNN architecture

3. Proposed Algorithm

An algorithm is proposed according to all the techniques proposed by the

authors where a flow of process is shown in figure 6. This process is almost

used my many researchers in recognition of the data but only the techniques

varies.

International Journal of Pure and Applied Mathematics Special Issue

192

Fig. 6: Proposed algorithm

4. Results and Discussion

Continuous speech is recorded and database is created. The continuous speech

consists of 600 sentences where 446 uncommon words are considered. The

recording or collecting the data base is done in a noise free environment.

Around 24 hours continuous speech is read by 3 male speakers and 3 female

speakers for training database. 7 male speakers and 7 female speakers database

is created for testing. Overall 20 speakers used to create training and testing

database. The data is preprocessed by de-noising, framing, filtering,

amplification and windowing as discussed before and the detailed preprocessing

description is shown in table 1 with various parameters.

Table 1: Signal Preprocessing parameters

Signal features Design

Recorded Format .Wav format

Sampling Rate 16,000 Hz

Bit rate 64 bits/sec and 128 bits/sec

Mode (Mono or Stereo) Mono

Length of Hamming window 0.01285

FFT point 256

Number of Filters used 30

Lowest Frequency recorded 180 Hz.

Highest Frequency recorded 3650 Hz

Count of Cepstrum 12

Feature extraction is the next part where the MFCC, DWPD individual features

are extracted and joint MFCC and DWPD features are extracted by MATLAB

which is shown in figure 6. Comparatively joint features by MFCC and DWPD

Collecting Speech database

Speech converted to data

Speech Data preprocessing Speech

Feature Extraction by MFCC and DWPD Frame windowing Frame

Speech Recognition Time alignment

Feature classification by HNN based DNN 8 level DWT for De-noising

International Journal of Pure and Applied Mathematics Special Issue

193

proves more that 90 % quality rather that individual features. Features are

extracted for 64 bits/sec and 128 bits/sec which are shown in table 2. LSP, PPF,

CBI, gain, synchronization and FEC are the features extracted.

Fig. 7: Joint Feature extraction by MFCC and DWPD

Table 2: Various features extracted

Features Information in Bits Information in Bits

Line spectrum pairs (LSP) 11 24

Pitch prediction filter (PPF) 15 36

Code base indexes 18 21

Gain 18 44

Synchronization 1 1

FEC 1 2

Total 64 128

The extracted features are classified by using HMM based DNN most of the

parameters to consider is WER and word accuracy in word based approach.

Continuous speech is tested and shown in table 3 where recognition accuracy is

quite impressive.

Table 3: Recognition accuracy parameters

Different parameters Values

Words spoken 446

Word Errors 82 (Sub: 52 Ins: 5 Del: 25)

Recognition Accuracy 91.89 %

Sentences recorded 600

Time Audio: 72.86 sec

Processing: 185.92 sec

Speed 2.5 × Real time

Memory Average: 25.68 Mega bytes

Max: 30 Mega bytes

5. Conclusion

The speech recognition task by various approaches and techniques which are

suitable for generating greater accuracy is discussed with proper results and

discussions. The speech is recorded and preprocessed. The MFCC and DWPD

joint features are extracted and set of vectors are used for classification. HMM

based DNN technique is used by using word based approach and extensive

efficiency is calculated with very low word error rates. This Telugu speech

recognition system proves in quality wise and technical wise. Further instead of

word based approach researchers can work on Triphone based approach.

Further researchers can work on to design a system which has greater accuracy.

International Journal of Pure and Applied Mathematics Special Issue

194

Acknowledgment

Authors like to thank Almighty God first to give an opportunity to do research

in Speech recognition especially in Telugu language. Authors would like to

thank Amity University Rajasthan for the support to issue a separate lab for

speech processing and recognition. Authors would like to give the greetings to

Dr. Sonia sunny to guide in extraction of features and de-noising. Special thanks

to, colleagues of ECE department Amity University Rajasthan, last but not least

thanks to all family members, friends and foes.

References

[1] Ramesh P. babu, Digital Signal Processing, Fourth edition. SciTech Publications (2003).

[2] Yadav, Ashwani Kumar, Ratnadeep Roy, Raj Kumar, Cheruku Sandesh Kumar, Archek Praveen Kumar, Algorithm for De-Noising of Color Images based on Median Filter, Third International Conference on Image Information Processing (ICIIP) (2015), 428-432.

[3] Kumar, Archek Praveen, Neeraj Kumar, Cheruku Sandesh Kumar, Ashwani Kumar Yadav, Abhay Sharma, Speech Recognition using Arithmetic Coding and MFCC for Telugu Language, 3rd International Conference on Computing for Sustainable Global Development (2016), 265-268.

[4] Sunny S., Peter D., Jacob K, A comparative study of wavelet based feature extraction techniques in recognizing isolated spoken words, International Journal of signal processing systems 7(10) (2013).

[5] Sunny S., Peter D., Jacob K, Combined feature extraction techniques and Naïve Bayes classifier for speech, Computer science and information technology, Academy & industry research collabration centre 3 (2013).

[6] Kumar A.P., Kumar N., Kumar C.S., Yadav A.K, Speech compression by adaptive Huffman coding using Vitter algorithm, International Journal of Innovative Sciences 2(5) (2015).

[7] Sunny S., Peter D., Jacob K, Performance of different classifiers in speech recognition, International Journal of Research and Engineering Technology 2(4) (2013).

[8] Rajesh M., Hema A., Rao V.R, Continouse speech recognition using joint features derived from modified group delay function and MFCC, semantic scholor journal (2012).

[9] Kumar A.P., Roy R., Rawat S., Sharma A, Telugu speech feature extraction by MODGDF and MFCC using Naïve Bayes classifier,

International Journal of Pure and Applied Mathematics Special Issue

195

International Journal of Control Theory and Applications 9(21) (2016).

[10] Thangarajan R., Natarajan A.M., Selvam M, Word and triphone based approaches in continuous speech recognition for Tamil language, WSEAS transactions on signal processing 4(3) (2008), 76-86.

[11] Venkateswarlu R.L.K., Ravi Teja R., Vasantha Kumari R, Developing efficient speech recognition system for Telugu letter recognition, International Conference on Computing, Communication and Applications (ICCCA) (2012), 1-6.

[12] Goto, Yoshihoko, Michael M. Hochberg, Daniel J. Mashao, Harvey F. Silverman, Incremental MAP Estimation of HMMs for Efficient Training and Improved performance, In International Conference on Acoustics, Speech, and Signal Processing 1 (1995) 457-460.

[13] Kalyani N., Sunitha K.V.N, Syllable analysis to build a dictation system in Telugu language, International journal of computer science and information technology 6(3) (2009).

[14] Dalmiya C.P., Dharun V.S., Rajesh K.P, An efficient method for Tamil speech recognition using MFCC and DTW for mobile applications, IEEE Conference on Information and Communication Technologies (2013).

[15] Singh, Satyanand E.G, Rajan, Vector Quantization Approach for Speaker Recognition using MFCC and Inverted MFCC, International Journal of Computer Applications 17(1) (2011), 0975-8887.

[16] Hossan, Md Afzal, Sheeraz Memon, Mark A, Gregory, A Novel Approach for MFCC Feature Extraction, 4th International Conference on Signal Processing and Communication Systems (ICSPCS) (2010), 1-5.

[17] Hornik, Kurt, Maxwell Stinchcombe, Halbert White, Multilayer Feedforward Networks are Universal Approximators, Neural Networks 2(5) (1989), 359-366.

[18] Dahl, George E., Dong Yu, Li Deng, Alex Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing 20(1) (2012), 30-42.

International Journal of Pure and Applied Mathematics Special Issue

196

197

198