Upload
phamcong
View
214
Download
0
Embed Size (px)
Citation preview
Continuous Telugu Speech Recognition through
Combined Feature Extraction by MFCC and DWPD
Using HMM based DNN Techniques 1Archek Praveen Kumar,
2Ratnadeep Roy,
3Sanyog Rawat and
4Prathibha Sudhakaran
Abstract ASR means automatic speech recognition which deals with recognition of
human speech for proper and reliable communication between human and
machine. Speech recognition is actually done in two phases, one is speaker
dependent and second one is speaker independent. Speaker dependent
recognition systems are used in biometric applications to identify particular
user. Speaker independent is used to convert speech to text for identifying the
language. This research deals with speaker independent system for Telugu
language to recognize continuous words. Emotion and accent vary in Telugu
language. Continuous speech recognition is done by combined MFCC (Mel
frequency Cepstralcoeffients) and DWPD (Discrete wavelet packet
Decomposition) techniques. DWT (Discrete Wavelet Transformation) is used
for De-noising and pre-processing of recorded speech. The extracted features
are classified by highly efficient technique HMM (Hidden Markova Model)
based DNN (Deep Neural Networks). DNN is advanced technology over
ANN which gives greater recognition accuracy. Word model based approach
is used while recognizing the continuous data. This simulation deals with
word based context independent acoustic model with 446 uncommon words.
Key Words:Continous speech recognition, Telugu language, MFCC,
DWPD, DWT, HMM, DNN.
International Journal of Pure and Applied MathematicsVolume 114 No. 11 2017, 187-197ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu
187
1. Introduction
The Speech is one dimensional signal. Speech Processing deals with many more
factors like frequency and cepstral parameters. Speech recognition systems are
mainly classified into speaker dependent systems and speaker independent
systems [1]. Speaker dependent systems are used to identify a particular speaker
which is used in security and biometric applications. Speaker independent
systems are used to convert the speech to text especially used for audio subtitle
applications. Research in speech is done with recognition of language for
alphabets, isolated words and continuous words [2]. Many researchers worked
on Telugu language from past few years but worked on either recognition in
alphabets or isolated words but continuous words are the keen important area to
be considered to design a perfect recognition system. Speech recognition is
basically done on three approaches, word based approach, phone based
approach and syllable based approach. This paper deals with word based
approach. The word is continuous word which is with a large vocabulary.
Design of continuous speech word based system with greater accuracy is a
challenging task for the researchers. Speech recognition is most popular
research area where many European and Russian languages are recognized and
even it was well developed for Asian languages like Chinese etc., but Indian
languages speech recognition is still in beginning phase. Specifically there is a
lot of scope for research in south Indian language Telugu. Few researchers
attempted to recognize the Telugu language words by different feature
extraction and classification techniques with different recognition accuracy.
Researchers Archek et.al, attempted Telugu speech recognition by combining
MFCC and arithmetic coding by using ANN with promised recognition
accuracy [3]. In the same way Author Sonia sunny etal worked on wavelet
based feature extraction technique for isolated words on Malayalam language
with esteemed results [4]. Again same author worked on Malayalam language
recognition for the same isolated words but with different classifier called Naive
bays classifier for better recognition [5]. Researcher Archek et.al, again worked
on Telugu speech recognition with compression of data by using adaptive
Huffman coding and also proved that recognition can also be achieved by Vitter
algorithm [6]. Funded research is done on Malayalam speech recognition and
compared with all the recognition classifiers and stated that ANN was better
than SVM and Naive Bayes by author [7]. Combined feature extraction
techniques by MFCC and MODGDF used by author Rajesh etal where he
worked for continuous words for good accuracy around 86 to 87 % [8]. As
similar to author Rajesh there is another paper worked on combine feature
extraction technique by MFCC and MOGDF but classification technique used
was Naive Bayes where recognition accuracy was quite high with 93.76 % [9].
Government funded project done by profThangarajan and group worked on
word and Triphone based approaches for recognition of Tamil language for
continuous words with challenging results with consideration of various factors
like accuracy, sentences, time, speed memory [10]. Author Rlkvenkateshwarulu
International Journal of Pure and Applied Mathematics Special Issue
188
developed a system to recognized Telugu speech but it is based on letter
approach. [11]. Lastly Incremental of maximum posterior is estimated by HMM
designed by authors Gotoh and Hochberg [12].Reviewing all these papers the
research in this paper deals with word based approach for continuous words by
combining MFCC, DWPD techniques and highly recommended deep neural
networks (DNN) which is based on HMM.
2. Block Description
Basically speech recognition has a flow where the detailed process is shown in
figure 1. Firstly the speech data base is collected where continuous speech is
recorded. The recorded speech is preprocessed also called as front end analysis
where sampling rate, windowing, softening are done. Now the preprocessed
signal is simulated to extract the features by using joint MFCC and DWPD
techniques. Lastly the extracted features are classified by using HMM based
DNN technique.
Fig. 1: Block diagram
A. Telugu Language
Telugu is the famous languages in south India. Telugu language is evolved from
Dravidian language. Telugu language has 60 letters where all utterance in
Telugu language ends with a vowel. The slang of Telugu is quite different
which accounts various considerations called as Acchulu and Allulu. Telugu is
also called as a classical language of India. Telugu is précised or authorized
language for states Andhra Pradesh and Telangana. Telugu to English
conversion is done by IPA format. 16 vowels, 3 vowel modifiers and 41
consonants are available in Telugu language [13].
B. Speech Database
Speech database is created by recording the speech signal in silent environment
since speech signals are easily affected by noise. Continuous speeches of 446
uncommon words are recorded by 20 speakers. This recording is done at a
sampling rate of 16 kHz. Speech is recorded in wave files which are later
converted to mono speech. Microphone should be suitable such that it converts
pressure variation of speech to electric signals.
C. Preprocessing or Front end of Processing
Front end analysis is important for every speech recognition system. As the
recorded speech signal is an integration of speech and noise so there are few
Speech Recording Speech Database Joint Feature
extraction by MFCC
and DWPD Feature Classification by
HMM and DNN Speech Recognition
International Journal of Pure and Applied Mathematics Special Issue
189
steps which are followed in preprocessing as shown in figure 2
Fig. 2: Pre Processing flow diagram
DWT (discrete wavelet transformation) is an appropriate technique used for De-
noising by many researchers. Hard Thresholding and soft Thresholding are the
two techniques can be used in DWT. This research uses soft threshold with 8
levels which almost all removes the noise. This preprocessing includes
segmenting accompanying framing, windowing which uses hamming
windowing technique.
D. Feature extraction
Feature extraction is major part in speech recognition. MFCC and WPD are the
most important techniques used for feature extraction. MFCC works on cepstral
domain. Vectors are extracted from these techniques. Features are extracted
individually and jointly. Joint feature extraction is one of the interesting areas
where more features are extracted compared to individual techniques.
MFCC
MFCC is Mel frequency cepstral coefficients. Short time power spectrum of
speech is calculated. 12 coefficients are determined by cepstral domain. Mel is a
scale where band of frequencies are spaced equally. Normal cepstrum
frequency bands are linearly spaced but Mel have equally spaced bands. This
equaling or wrapping of frequencies are useful to extract the features. The
Segmentation Speech
Frame blocking
windowing
Time alignment
8 level DWT for De-noising
International Journal of Pure and Applied Mathematics Special Issue
190
detailed process of MFCC is shown in figure 3. The speech signal is
transformed by using Fast Fourier transform (FFT) [14]. The frequency
spectrum is Mel scaled and filtered by using mel scale filtering. Log value of
Mel scale filtered frequency is calculated and DCT (Discrete cosine
transformation) is applied. The discrete coefficients are calculated by DCT. The
derivative of DCT coefficients derives the vectors which are also called as
features [15] [16].
Fig. 3: MFCC block diagram
DWPD
Discrete wavelet packet decomposition is integration of discrete wavelet
transformation (DWT) and wavelet packet decomposition (WPD). Wavelets are
highly useful technique for speech recognition in terms of de-noising and
feature extraction. Wavelet separates high frequency and low frequency sub
bands. Low frequency consists of features or characteristics of signals and high
frequency consists of noise content. Wavelet is to diagnose in to low frequency
component in to maximum levels in this research considered as 8 levels. High
frequency component sometimes have noise and information so this is a very
big disadvantage to apply DWT which can be overcome by WPD but this
increases the computational complexity. DWPD is a combination of DWT and
WPD which removes the complexity and enhances the speech for the extraction
of features. The detailed view of DWT, WPD and DWPD is shown in figure 4
from left to right.
Fig. 4: DWT, WPD and DWPD tree diagram
The process of DWPD is firstly the speech signal is split in to low frequency
and high frequency which is done in first level of decomposition. The last 7
scales of DWT are composed on low frequency component and WPD on high
frequency component. The features obtained by the DWT and WPD are
combined and maximum vectors sets are generated.
International Journal of Pure and Applied Mathematics Special Issue
191
Joint features from MFCC and DWPD are generated and feature vector sets are
stored for classification. If the features extracted are more and efficient more
will be the recognition accuracy.
E. Feature Classification
Features extracted are classified for recognizing. There are numerous classifiers
used for recognition of speech. Feature classification can be done by either
pattern recognition or vector analysis or artificial neural networks. This paper
deals with advanced version of ANN called as deep neural networks based on
HMM [17]. Speech recognition system involves statistical measures like
conditional probabilities form the model to design posterior probabilities. HMM
normalizes the speech signals with respect to time thus it helps to characterize
the conditional probabilities. This paper designed HMM based DNN system for
continuous Telugu speech. A system for continuous speech of Telugu language
is by CI based word model approach. Every word is mapped accordingly.
Repeated words have different pronunciation so each word is indexed properly
for further processing. Classifications in deep neural networks are done by two
phases, training phase and testing phase. 40% of collected data is used for
training phase and 60 % for testing phase. ANN based HMM is used for design
but the recent approach for the model is DNN based HMM which improves
recognition accuracy by sorting LVSR problems [18].
DNN use deeper expressive neural network frame which consists of DBN
(Deep belief network) algorithm which makes the training data more suitable.
DNN also uses posterior statistical probabilities according to the word based
approach. DNN base HMM is Hybrid architecture of senones and can be easily
suitable for continuous speech. Training with senones allows more data to be
represented in training network. The detailed DNN based HMM is shown in
figure 5. LVSCR system is considered to build word based architecture.
Fig. 5: HMM based DNN architecture
3. Proposed Algorithm
An algorithm is proposed according to all the techniques proposed by the
authors where a flow of process is shown in figure 6. This process is almost
used my many researchers in recognition of the data but only the techniques
varies.
International Journal of Pure and Applied Mathematics Special Issue
192
Fig. 6: Proposed algorithm
4. Results and Discussion
Continuous speech is recorded and database is created. The continuous speech
consists of 600 sentences where 446 uncommon words are considered. The
recording or collecting the data base is done in a noise free environment.
Around 24 hours continuous speech is read by 3 male speakers and 3 female
speakers for training database. 7 male speakers and 7 female speakers database
is created for testing. Overall 20 speakers used to create training and testing
database. The data is preprocessed by de-noising, framing, filtering,
amplification and windowing as discussed before and the detailed preprocessing
description is shown in table 1 with various parameters.
Table 1: Signal Preprocessing parameters
Signal features Design
Recorded Format .Wav format
Sampling Rate 16,000 Hz
Bit rate 64 bits/sec and 128 bits/sec
Mode (Mono or Stereo) Mono
Length of Hamming window 0.01285
FFT point 256
Number of Filters used 30
Lowest Frequency recorded 180 Hz.
Highest Frequency recorded 3650 Hz
Count of Cepstrum 12
Feature extraction is the next part where the MFCC, DWPD individual features
are extracted and joint MFCC and DWPD features are extracted by MATLAB
which is shown in figure 6. Comparatively joint features by MFCC and DWPD
Collecting Speech database
Speech converted to data
Speech Data preprocessing Speech
Feature Extraction by MFCC and DWPD Frame windowing Frame
Speech Recognition Time alignment
Feature classification by HNN based DNN 8 level DWT for De-noising
International Journal of Pure and Applied Mathematics Special Issue
193
proves more that 90 % quality rather that individual features. Features are
extracted for 64 bits/sec and 128 bits/sec which are shown in table 2. LSP, PPF,
CBI, gain, synchronization and FEC are the features extracted.
Fig. 7: Joint Feature extraction by MFCC and DWPD
Table 2: Various features extracted
Features Information in Bits Information in Bits
Line spectrum pairs (LSP) 11 24
Pitch prediction filter (PPF) 15 36
Code base indexes 18 21
Gain 18 44
Synchronization 1 1
FEC 1 2
Total 64 128
The extracted features are classified by using HMM based DNN most of the
parameters to consider is WER and word accuracy in word based approach.
Continuous speech is tested and shown in table 3 where recognition accuracy is
quite impressive.
Table 3: Recognition accuracy parameters
Different parameters Values
Words spoken 446
Word Errors 82 (Sub: 52 Ins: 5 Del: 25)
Recognition Accuracy 91.89 %
Sentences recorded 600
Time Audio: 72.86 sec
Processing: 185.92 sec
Speed 2.5 × Real time
Memory Average: 25.68 Mega bytes
Max: 30 Mega bytes
5. Conclusion
The speech recognition task by various approaches and techniques which are
suitable for generating greater accuracy is discussed with proper results and
discussions. The speech is recorded and preprocessed. The MFCC and DWPD
joint features are extracted and set of vectors are used for classification. HMM
based DNN technique is used by using word based approach and extensive
efficiency is calculated with very low word error rates. This Telugu speech
recognition system proves in quality wise and technical wise. Further instead of
word based approach researchers can work on Triphone based approach.
Further researchers can work on to design a system which has greater accuracy.
International Journal of Pure and Applied Mathematics Special Issue
194
Acknowledgment
Authors like to thank Almighty God first to give an opportunity to do research
in Speech recognition especially in Telugu language. Authors would like to
thank Amity University Rajasthan for the support to issue a separate lab for
speech processing and recognition. Authors would like to give the greetings to
Dr. Sonia sunny to guide in extraction of features and de-noising. Special thanks
to, colleagues of ECE department Amity University Rajasthan, last but not least
thanks to all family members, friends and foes.
References
[1] Ramesh P. babu, Digital Signal Processing, Fourth edition. SciTech Publications (2003).
[2] Yadav, Ashwani Kumar, Ratnadeep Roy, Raj Kumar, Cheruku Sandesh Kumar, Archek Praveen Kumar, Algorithm for De-Noising of Color Images based on Median Filter, Third International Conference on Image Information Processing (ICIIP) (2015), 428-432.
[3] Kumar, Archek Praveen, Neeraj Kumar, Cheruku Sandesh Kumar, Ashwani Kumar Yadav, Abhay Sharma, Speech Recognition using Arithmetic Coding and MFCC for Telugu Language, 3rd International Conference on Computing for Sustainable Global Development (2016), 265-268.
[4] Sunny S., Peter D., Jacob K, A comparative study of wavelet based feature extraction techniques in recognizing isolated spoken words, International Journal of signal processing systems 7(10) (2013).
[5] Sunny S., Peter D., Jacob K, Combined feature extraction techniques and Naïve Bayes classifier for speech, Computer science and information technology, Academy & industry research collabration centre 3 (2013).
[6] Kumar A.P., Kumar N., Kumar C.S., Yadav A.K, Speech compression by adaptive Huffman coding using Vitter algorithm, International Journal of Innovative Sciences 2(5) (2015).
[7] Sunny S., Peter D., Jacob K, Performance of different classifiers in speech recognition, International Journal of Research and Engineering Technology 2(4) (2013).
[8] Rajesh M., Hema A., Rao V.R, Continouse speech recognition using joint features derived from modified group delay function and MFCC, semantic scholor journal (2012).
[9] Kumar A.P., Roy R., Rawat S., Sharma A, Telugu speech feature extraction by MODGDF and MFCC using Naïve Bayes classifier,
International Journal of Pure and Applied Mathematics Special Issue
195
International Journal of Control Theory and Applications 9(21) (2016).
[10] Thangarajan R., Natarajan A.M., Selvam M, Word and triphone based approaches in continuous speech recognition for Tamil language, WSEAS transactions on signal processing 4(3) (2008), 76-86.
[11] Venkateswarlu R.L.K., Ravi Teja R., Vasantha Kumari R, Developing efficient speech recognition system for Telugu letter recognition, International Conference on Computing, Communication and Applications (ICCCA) (2012), 1-6.
[12] Goto, Yoshihoko, Michael M. Hochberg, Daniel J. Mashao, Harvey F. Silverman, Incremental MAP Estimation of HMMs for Efficient Training and Improved performance, In International Conference on Acoustics, Speech, and Signal Processing 1 (1995) 457-460.
[13] Kalyani N., Sunitha K.V.N, Syllable analysis to build a dictation system in Telugu language, International journal of computer science and information technology 6(3) (2009).
[14] Dalmiya C.P., Dharun V.S., Rajesh K.P, An efficient method for Tamil speech recognition using MFCC and DTW for mobile applications, IEEE Conference on Information and Communication Technologies (2013).
[15] Singh, Satyanand E.G, Rajan, Vector Quantization Approach for Speaker Recognition using MFCC and Inverted MFCC, International Journal of Computer Applications 17(1) (2011), 0975-8887.
[16] Hossan, Md Afzal, Sheeraz Memon, Mark A, Gregory, A Novel Approach for MFCC Feature Extraction, 4th International Conference on Signal Processing and Communication Systems (ICSPCS) (2010), 1-5.
[17] Hornik, Kurt, Maxwell Stinchcombe, Halbert White, Multilayer Feedforward Networks are Universal Approximators, Neural Networks 2(5) (1989), 359-366.
[18] Dahl, George E., Dong Yu, Li Deng, Alex Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing 20(1) (2012), 30-42.
International Journal of Pure and Applied Mathematics Special Issue
196