8
Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences ©2009-2015 CIS Journal. All rights reserved. http://www.cisjournal.org 219 Nasalized Vowel Detection Based on a Threshold of an Acoustic Parameter 1 Shamima Najnin, 2 Celia Shahnaz 1 Department of Electrical and Computer Engineering, Herff College of Engineering, University of Memphis, Tennessee, USA 2 Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh 1 [email protected] , 2 [email protected] ABSTRACT In this paper, a threshold based approach using an acoustic parameter derived from the modified group delay function is presented for the detection of nasalized vowels from the mixture of oral and nasalized vowels of normal speakers. It is shown through acoustic analysis on the modified group delay spectrum that during the event of nasalization additional formants (resonances) at various frequency locations are introduced. Also, it is found that the introduction of a new formant in low frequency region around 250Hz remains consistent irrespective of female or male speakers. By verifying this fact on the band-limited modified group delay spectrum that is capable of resolving two closely spaced formants, an acoustic parameter RMGD is formed. Utilizing RMGD, the idea of detection of nasalized and oral vowels is formulated as a two-class problem and solved based on a threshold based scheme .Simulation results on TIMIT database show that the proposed method detect nasalized vowel from the mixture of oral and nasalized vowel with 92% average accuracy. Keywords: Formant, Oral vowel, Nasalized vowel, Group delay function, Acoustic analysis 1. INTRODUCTION In the event of nasalization, velum drops to allow coupling between the oral and nasal cavities. When this happens, the oral cavity is still the major source of output and the sound gets distinctively nasal characteristics. Over 99% of languages contain nasalized vowels or consonants [1]. Since during the production of a nasal consonant, the vocal tract is excited by the vocal fold vibration, it is considered to be voiced. When nasal consonants are produced, air flows through the nasal tract and is radiated at the nostrils. The closed oral cavity and the sinuses of the nose from shunting cavities to the main path substantially influence the resulting radiated sound. Nasalized vowels are pronounced in a manner similar to nasal consonants, with the exception being that the oral cavity is not blocked, thereby allowing air to flow through both the nasal and oral cavities. In many languages, including American English, nasal consonants can have a profound effect on neighboring vowels. Following the release of a nasal consonant, the initial portion of a following vowel will be nasalized during the time interval that the velum is closing. The same holds true for the final portion of a vowel preceding a nasal consonant. The amount of co-articulated nasalization depends upon the particular language and dialect. Coarticulatory nasalization of the vowel preceding a nasal consonant is a regular phenomenon in all languages of the world. The coarticulation can, however, be so large that the nasal murmur (the sound produced with a complete closure at a point in the oral cavity, and with an appreciable amount of coupling of the nasal passages to the vocal tract) is completely deleted and the cue for the nasal consonant is only present as nasalization in the preceding vowel. This is especially true for spontaneous speech. Since anticipatory nasalization is common in American English, a sequence of a vowel plus a nasal consonant (VN) may, in many situations, be pronounced as a simple nasalized vowel, or a nasalized vowel plus a short, residual nasal murmur [2]. Though humans are sensitive to nasalized sounds, automated speech recognizers perform poorly when it comes to nasalized vowels [3]. Hence, A vowel nasalization detector is also essential for speech recognition in languages with phonemic nasalization (i.e. there are minimal pairs of words in such languages which differ in meaning with just a change in the nasalization in the vowel), and therefore, considered an important part of a landmark-based speech recognition system. Further, it is suggested in [3] that detection of vowel nasalization is important to give the pronunciation model the ability to learn that a nasalized vowel is a high probability substitute for a nasal consonant. Note that nasalization of the vowel might be the only feature distinguishing “cat” from “can’t”. Hence, the automatic detection of vowel nasalization is an important and challenging problem. Research has been reported on many aspects of vowel nasalization in terms of acoustics, perception, and physiology [4], [5]. In [6], it is observed that oral-nasal coupling introduces an additional pole-zero pair for nasalized vowels. It is shown in [2], [7] that the main features of nasalization are changes in the low-frequency regions of the speech spectrum. Apart from the introduction of a new pole-zero pair in the first formant region, it is complied with the fact that nasalization gives rise to changes in the spectrum in the high frequency region also [7]. However, changes in the spectrum in the high frequency region are not consistent across speakers and vowels compared to the changes in the spectrum in the low frequency region around the vicinity of the first formant. It is found in [8] that the difference between the amplitude of the first formant and the amplitude of the extra peak near 1 KHz can be used to measure nasalization and the degree of nasalization. It is to be

Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

219

Nasalized Vowel Detection Based on a Threshold of an Acoustic Parameter

1 Shamima Najnin, 2 Celia Shahnaz 1 Department of Electrical and Computer Engineering, Herff College of Engineering, University of Memphis, Tennessee, USA

2 Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh 1 [email protected], 2 [email protected]

ABSTRACT

In this paper, a threshold based approach using an acoustic parameter derived from the modified group delay function is presented for the detection of nasalized vowels from the mixture of oral and nasalized vowels of normal speakers. It is shown through acoustic analysis on the modified group delay spectrum that during the event of nasalization additional formants (resonances) at various frequency locations are introduced. Also, it is found that the introduction of a new formant in low frequency region around 250Hz remains consistent irrespective of female or male speakers. By verifying this fact on the band-limited modified group delay spectrum that is capable of resolving two closely spaced formants, an acoustic parameter RMGD is formed. Utilizing RMGD, the idea of detection of nasalized and oral vowels is formulated as a two-class problem and solved based on a threshold based scheme .Simulation results on TIMIT database show that the proposed method detect nasalized vowel from the mixture of oral and nasalized vowel with 92% average accuracy. Keywords: Formant, Oral vowel, Nasalized vowel, Group delay function, Acoustic analysis 1. INTRODUCTION

In the event of nasalization, velum drops to allow coupling between the oral and nasal cavities. When this happens, the oral cavity is still the major source of output and the sound gets distinctively nasal characteristics. Over 99% of languages contain nasalized vowels or consonants [1]. Since during the production of a nasal consonant, the vocal tract is excited by the vocal fold vibration, it is considered to be voiced. When nasal consonants are produced, air flows through the nasal tract and is radiated at the nostrils. The closed oral cavity and the sinuses of the nose from shunting cavities to the main path substantially influence the resulting radiated sound.

Nasalized vowels are pronounced in a manner

similar to nasal consonants, with the exception being that the oral cavity is not blocked, thereby allowing air to flow through both the nasal and oral cavities. In many languages, including American English, nasal consonants can have a profound effect on neighboring vowels.

Following the release of a nasal consonant, the

initial portion of a following vowel will be nasalized during the time interval that the velum is closing. The same holds true for the final portion of a vowel preceding a nasal consonant. The amount of co-articulated nasalization depends upon the particular language and dialect. Coarticulatory nasalization of the vowel preceding a nasal consonant is a regular phenomenon in all languages of the world. The coarticulation can, however, be so large that the nasal murmur (the sound produced with a complete closure at a point in the oral cavity, and with an appreciable amount of coupling of the nasal passages to the vocal tract) is completely deleted and the cue for the nasal consonant is only present as nasalization in the preceding vowel. This is especially true for spontaneous speech. Since anticipatory nasalization is common in American English, a sequence of a vowel plus a nasal consonant (VN) may, in many

situations, be pronounced as a simple nasalized vowel, or a nasalized vowel plus a short, residual nasal murmur [2].

Though humans are sensitive to nasalized sounds, automated speech recognizers perform poorly when it comes to nasalized vowels [3]. Hence, A vowel nasalization detector is also essential for speech recognition in languages with phonemic nasalization (i.e. there are minimal pairs of words in such languages which differ in meaning with just a change in the nasalization in the vowel), and therefore, considered an important part of a landmark-based speech recognition system. Further, it is suggested in [3] that detection of vowel nasalization is important to give the pronunciation model the ability to learn that a nasalized vowel is a high probability substitute for a nasal consonant. Note that nasalization of the vowel might be the only feature distinguishing “cat” from “can’t”. Hence, the automatic detection of vowel nasalization is an important and challenging problem.

Research has been reported on many aspects of vowel nasalization in terms of acoustics, perception, and physiology [4], [5]. In [6], it is observed that oral-nasal coupling introduces an additional pole-zero pair for nasalized vowels. It is shown in [2], [7] that the main features of nasalization are changes in the low-frequency regions of the speech spectrum. Apart from the introduction of a new pole-zero pair in the first formant region, it is complied with the fact that nasalization gives rise to changes in the spectrum in the high frequency region also [7]. However, changes in the spectrum in the high frequency region are not consistent across speakers and vowels compared to the changes in the spectrum in the low frequency region around the vicinity of the first formant. It is found in [8] that the difference between the amplitude of the first formant and the amplitude of the extra peak near 1 KHz can be used to measure nasalization and the degree of nasalization. It is to be

Page 2: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

220

mentioned that results for the detection of nasalized vowels have not been reported in [8]. The nasalized vowel is detected using the sensitivity of the Teager energy operator in [9].A measurable difference between the low-pass and band-pass profile of the Teager energy operator for the nasalized vowels is observed in [9], whereas no difference is found for the normal vowel that is a single component signal. The most popular analysis method for automatic speech recognition uses MFCCs that represent speech spectra incorporating some aspects of audition. In particular, MFCC parameters have been used in the methods for automatic detection of vowel nasality [10], [12]. In [12], MFCC and SVM is used to study the overall patterns of nasality in large sets of vowel tokens, whereas in [10], the standard MFCC coefficients are extracted at the centre of the vowel and a SVM classifier is built to discriminate between oral and nasalized vowels in a vowel-independent manner. Although formants provide important acoustic cue for nasalized/oral vowels detection [13], no simple relationship exists between MFCC and formants, e.g. for a speech with four formants, a high second MFCC suggests high energy in first and third formants and low energy in second and fourth formants, but such a relationship is only approximate when the formants deviate from their average positions. Thus may affect nasalized/oral vowel detection.

In order to overcome the problems of the existing methods, the goal of our current work is to detect nasalized and oral vowels of normal speakers based on an acoustic parameter RMGD derived from the modified group delay spectrum. While developing the detection method, a prior knowledge about the acoustics of the nasalized vowels is exploited. Detailed Acoustic analysis is carried out in the vicinity of the first formant frequency for the vowels /a/, /i/, and /u/ of a large number of normal male and female speakers. The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture of oral and nasalized vowels of normal speakers.

This paper is organized as follows. In section 2, modified group delay function is defined and is shown to be effective for the acoustic analysis of real speech signal with two closely spaced formants. Utilizing the modified group delay based acoustic parameter derived from the real speech signal; the proposed method of detecting the nasalized/oral vowels is also described in the section. In section 3, a detailed simulation is carried out using standard TIMIT speech database. The contribution of the proposed method is highlighted with some concluding remarks in section 4. 2. PROPOSED METHOD

2.1 Modified Group Delay Function

For a frame of speech signal x(n), n = 0,1,……, N-1, the Fourier Transform is given by

( )( ) ( ) j kX k X k e θ= . (1) Here, θ (k) represents the phase spectrum of x(n).

The negative derivative of θ (k) is defined as group delay which can be written as

kkk

∂∂

−=)()( θτ . (2)

Simplifying equation (2), we obtain from [14] that

kkk

∂∂

−=))(log(Im)( θτ ,

2( ) ( ) ( ) ( )( ) ,

( )R R I IX k Y k X k Y kk

X kτ +

= (3)

for k = 0, 1, 2,……….., N-1. In equation (3), X(k) and Y(k) are the N-point Fast Fourier Transforms (FFTs) of the sequences x(n) and nx(n) with subscripts R and I denoting the real and imaginary parts, respectively. In Figs.1 (a), (b) and (c), a vowel sound /i/, its power spectrum and group delay function are plotted. As seen from these figures, there are meaningless peaks and valleys exist in the group delay function spectrum. In order to reduce the spiky nature of the group delay function occurring due to the pitch peaks, noise, and windowing effects, equation (3) is modified by replacing the power spectrum with the spectrally smoothed spectrum . Thus the meaningful modified group delay function is given by [15]

2( ) ( ) ( ) ( )( ) .

( )R R I I

pX k Y k X k Y kk sign

S k

α

γτ += (4)

In equation (4), the sign presents the sign of the

original group delay function given in (3). Assuming the speech production model is a multiplication between vocal tract information V(k) and excitation spectrum U(k), the group delay function of the speech x(n) can be written as

(arg( ( ) ( )))( ) ( ) ( ),V Ux

V k U kk k kk

δτ τ τδ

= − = + (5)

where τv(k) represents modified group delay function of vocal tract and τu(k) stands for the modified group delay function of the excitation, τv(k) and τu(k) are distinguishable in the modified group delay domain. Because of this additive property these modified group delay functions are easily separable. Moreover, the frequency resolution of the speech spectrum improves in the modified group delay domain compared to that in the magnitude spectrum.

Page 3: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

221

Fig 1(a): vowel sound/i/; b) power spectrum of the vowel

/i/; c) group delay spectrum of the vowel /i/.

Exploiting these facts, the modified group delay function can be utilized as it is expected to be suitable for providing vocal tract Information required for speech/ vowels detection purpose. 2.2 Acoustic Analysis on Real Speech Signal

The formant frequencies of oral vowels are considered as reference for the acoustic analysis on nasalized vowels [13]. The transition from one oral vowel to nasal consonant or vice versa, may involve an intermediate nasalized vowel sound, as the velum lowers before or raises after oral closure [13]. Acoustic analysis shows that there is an extra formant around 250Hz for nasalized /a/ vowel, and that around 1000Hz for nasalized /i/ and /u/ vowels [16]. Let, F1, F2, F3 are the first three formant frequencies for /a/, /i/, /u/ vowels. As there is a formant around 250Hz for all the vowels, the F1 and F2 are closely spaced formants. The difficulty is now to extract F1 and F2 from appropriate spectra of the real speech signal. In order to validate the effectiveness of modified group delay function in extracting closely

spaced F1 and F2, conventional magnitude spectrum using Discrete Fourier Transform (DFT), spectrally smoothed DFT spectrum, Linear prediction (LP) spectrum using different orders, and proposed modified group delay spectrum are obtained and plotted in Fig. 2 by using a real speech signal. It can be seen from Fig. 2 that the DFT spectrum is not capable of clearly exhibiting F1 and F2 due to the presence of pitch harmonics. Even, the campestral smoothing of the DFT spectrum cannot distinguish closely spaced F1 and F2. It is observed that the LP spectrum with increased order also fails to represent closely spaced F1 and F2. However, the spectrum computed based on the modified group delay function cannot resolve the closed spaced F1 and F2 due to the influence of adjacent poles. The influence of other poles on the value of the modified group delay function of a certain pole k may be written as

1&( ) ( )

Pi

p rrp p r

k kτ τ= ≠

= ∑ . (6)

In equation (6), influence of zeros is not taken

into account and represents the value of group delay function at the angular frequency due to the p-th pole. Since the amount of influence of other poles on the group delay function of a given pole is proportional to the number of poles (or formants) of a speech signal, reducing the number of formants would yield a reduction of the influence of the adjacent poles. The influence of less important higher formants in the low-frequency region of the speech spectrum, which is the band of interest, can be avoided by band-limiting the speech signal via low-pass filtering. This operation on the speech signal is expected to yield more distinct peaks in the modified spectrum, a feature that would be useful in extracting such closely spaced formants by overcoming the influence of the adjacent poles. In order to verify the effect of band-limiting on the modified group delay spectrum, the real speech signal with closely spaced formants as considered in the plots of Fig. 2 filtered with a low-pass finite impulse response filter with a cutoff frequency of 800Hz. The modified group delay spectrum computed for this band-limited real speech signal is plotted in Fig. 3. Comparing the modified group delay spectra computed prior to and after low-pass filtering as seen in Figs. 3(c) and 3(d) respectively, it is vivid that the band-limiting operation helps resolving the two closely spaced formants very distinctively. Hence, the modified group delay spectrum computed from a band-limited speech signal is an effective means of extracting two closely spaced formants as occurs in case of a nasalized vowel. Here, the speech signal is passed through a low-pass filter with a cutoff frequency 800 Hz (first formant of /a/) in order to accommodate only the lower formants including the first formant of all the phonemes considered. The modified group delay spectra of

Page 4: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

222

Fig 2 (a): Real speech signal (vowel /i/), (b) DFT magnitude spectrum , (c) spectrally smoothed DFT

spectrum, (d) LP spectrum (order 8), (e) LP spectrum(order 12), (f) modified group delay spectrum

Fig 3 (a): Low-pass filtered real speech signal,(b) DFT magnitude spectrum of the signal, (c) modified group

Page 5: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

223

delay spectrum prior to low pass filtering, (d) modified group delay spectrum of low-pass filtered speech signal.

Oral and nasalized /a/, /i/, and /u/ phonemes are

considered in Fig. 4(a) through Fig. 4(f). It is clear from these figures that an extra formant is introduced around 250Hz invariably for all nasalized /a/, /i/, and /u/. In a band-limited spectrum below 800 Hz, we expect one (for /a/) or two( for /i/, /u/) formant peaks for normal oral vowels and two formant peaks (one oral and one nasal) for nasalized vowels. Rest of the peaks in this frequency region, if any, must correspond to the pitch harmonics in oral and nasal vowels. Using the band-limiting modified group delay spectra, the two formant frequencies for both oral and nasal phonemes /a/, /i/, and /u/ are extracted in the low-frequency region (800 Hz). Among these two, the lower frequency is termed as F1 and the higher frequency is named as F2. Here, F1 is a nasal formant that corresponds to a strong peak for a nasal phoneme as seen in Figs. 4(b), (d), and (f) and F1 corresponds to a relatively smaller peak for normal oral phonemes as portrayed in Figs. 4(a), (c), and (e). In the plots of Fig. 4, F2 is the oral formant corresponding to the phonemes /a/, /i/, and /u/ in both the normal and nasalized cases. The features of F1 and F2 corresponding to phonemes of the nasalized vowel are as follows:

• For the nasalized phoneme /a/, the nasal formant F1 varies from 200 Hz to 350Hz.The oral formant F2 for /a/ varies between speakers, and is distributed between 500Hzand 800 Hz.

• For nasalized /i/ and /u/, the nasal formants F1 are distributed between 200 Hz and 325 Hz. Due to the introduction of the nasal formant, the oral formants (F2) are observed around 350 to 500 Hz.

2.3 Detection of Nasalized and Oral Vowels

It is observed that the nasal formant introduced in the nasalized vowels and the pitch harmonics in case of normal oral vowels lie in the same frequency range. Also, oral formants of some of the normal oral vowels appear in the corresponding pitch harmonics range. Hence, identifying formant locations is not a reliable approach for the discrimination of nasal/oral vowels. As analyzed and mentioned before that in case of normal oral vowels, the modified group delay value at the locations of pitch harmonics is relatively small compared to the modified group delay value at the locations of oral formants. On the contrary, for the nasalized vowels, the modified group delay value at the locations of nasal formants are found stronger than the modified group delay values at the locations of oral formants. This interesting observation obtained from Fig.4 can be exploited for the detection of oral/nasalized vowels. For this purpose, depending on the modified group delay values at the locations of F1 and F2, as seen in Fig. 4, a new acoustic parameter ratio of modified group delay( RMGD) values as defined by

Fig 4: Modified Group delay spectra of: (a) oral /a/ and (b) nasalized /a/; (c) oral /i/ and (d) nasalized /i/; (e) oral

/u/ and (f) nasalized /u/.

Page 6: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

224

.Modified Group Delay at F1 ( 1)

Modified Group Delay at F2 ( 2)

FRMGD

F

τ

τ= =

(7)

Since the value of numerator in equation (7) is

higher than that in the denominator in case of nasalized vowels, the ratio RMGD will have a larger value. Whereas, in case of oral vowels, the RMGD yields a relatively lower value for normal oral vowels as the numerator value at F1 is generally smaller compared to the denominator value at F2. The RMGD computed for the nasalized and oral vowels (/a/,/i/,/u/) of both male and female speakers are plotted in Fig. 5. From this figure, the following observations can be made:

• For nasalized phoneme /a/, the RMGD is greater

than one and varies in the range of 1~16. For all the normal oral /a/ vowels, the RMGD is found to be always less than one.

• For nasalized phoneme /i/, the RMGD is found to have a value greater than that obtained for nasalized /a/, sometimes with a wider range 1~25. For the normal oral /i/ vowels the range is 0.5~ 3.5.

• Similarly, for nasalized phoneme /u/, the RMGD ranges from 1~38 and it varies from 0.2 ~ 2.1 for normal oral /u/ vowels.

The distribution of RMGD for all the vowels

attests that since there is no significant overlap between the values of RMGD of nasalized and oral vowels, it is possible to set a suitable fixed threshold to fairly discriminate nasalized and oral vowels. By defining the threshold of RMGD as ε, it is verified that if RMGD is greater than ε the test vowel is declared as nasalized, otherwise it is detected as oral vowel. Thus, RMGD can serve as a potential feature for the detection of nasal/oral vowel. 2.4 Threshold Based Detection In the threshold based detection using mixture of nasalized and oral phonemes /a/, or /i/, or /u/, the analysis is performed assuming that the incoming phoneme is unknown. The average accuracy over both nasalized and oral phonemes of a particular class of mixture for the nasalized/oral vowels detection can be obtained using different values of the threshold ε. However, if the class of incoming phoneme is known, the threshold for which the average accuracy (%) over both nasalized and oral phonemes of that class reaches to a maximum can be decided and by using the threshold, nasalized/oral detection can be performed. The threshold for which such accuracy (%) becomes maximum is different for different class of phonemes.

Fig 5: The distribution of RMGD of nasalized and oral vowels (a) for the phoneme /a/, (b) for the phoneme /i/,

and (c) for the phoneme /u/. 3. RESULTS

In this Section, a number of simulations is carried out to evaluate the performance of the proposed method as a potential feature.

3.1 Database and Simulation Conditions

In the simulations, we employed TIMIT database [17] that contains a total of 6300 sentences in which 10 sentences are spoken by each of 630 speakers (438 males, 192 females) from eight major dialect regions of the United States. The speech files are sampled at 16 KHz. While dividing the speech data uttered by 50 normal speakers into training and testing sets, vowels are considered to be oral/non-nasalized when they were not in the context of nasal consonants. It is assumed that all vowels preceding nasal consonants such as, /m/, /n/, /ng/ and /nx/ are nasalized [11], [18]. In the case of vowels following nasal consonants, nasalization might not be very strong [10] and such cases are not considered in this paper. The phonemes /a/, /i/, and /u/ are considered for the analysis and detection of oral/nasalized vowels. For oral vowels, speech data in the vicinity of the first formant frequency of the phonemes /a/, /i/, and /u/ with obstruent consonant on both sides of the vowels are collected from the normal speakers for comparative study. For the acoustic analysis on the nasalized vowels, speech data with the nasal consonant on preceding side of

Page 7: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

225

the vowels collected from the normal speakers are utilized. To derive a smoothed modified group delay spectrum, the values of α and γ in equation (4) is considered to be less than one, namely α=0.6 and γ=0.9.

The RMGD is computed for the phonemes /a/,

/i/, and /u/ for both nasalized and oral vowels collected from the 50 normal speakers as mentioned before. As performance metric for evaluating the proposed and other methods, the percent accuracy (%) is considered, which is defined as,

Number of correctly detected phonemes%

Total number of oral and nasalized phonemesAccuracy =

3.2 Simulation Results The average accuracy over both nasalized and oral phonemes of a particular class of mixture for the nasalized/oral vowels detection is shown in Table1 using different values of the threshold ε. For a particular threshold ε, Table1 also summarizes the average accuracy(%) over all the phonemes (/a/, /i/, and /u/) obtained for nasalized/oral detection. For such average accuracy calculation, it is observed from Table1 that it reaches to a maximum for anε of 1.0. The accuracy (%) of the detection of nasal/oral phoneme /a/ reaches to a

maximum of 98%, whereas for nasal/oral phonemes /i/ and /u/, the maximum accuracy is 94% and 90%, respectively. 4. CONCLUSION

In this paper, a threshold based scheme is used for the automatic detection of oral/nasal vowels based on an acoustic parameter RMGD that is developed from modified group delay function. It is shown that due to the additive property of modified group delay function of excitation and that of vocal tract, the modified group delay spectrum computed from a band limited speech signal is capable of resolving two closely spaced formants as occurs in the event of nasalized vowel. Instead of depending only on the formant locations in the modified group delay spectrum, the ratio of modified group delay values at the consecutive formant location are exploited to derive the acoustic parameter RMGD. Since RMGD exhibits larger values for nasalized vowels and relatively lower values for normal oral vowels, it is found as an effective feature to detect nasalized/oral vowels using a threshold based classifier. A detailed simulation is carried out using both nasalized and oral vowels collected from the TIMIT database. It is shown that the proposed method performs effectively with 92% average accuracy.

Table1: Average accuracy in % for the detection of nasalized/oral vowel using different values of threshold ε

Phoneme Threshold ε 1 1.5 2 2.5 3

/a/ 98 88 86 86 76 /i/ 88 94 94 92 84 /u/ 90 90 78 66 64

Average 92 90.67 86 81.33 74.67 REFERENCES [1] I. Maddieson, Patterns of Sounds, Cambridge:

Cambridge University Press, 1984. [2] J.R. Glass and V.W. Zue, “Detection of nasalized

vowels in American English,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, 1985, pp. 1569–1572.

[3] M. H. Johnson et al., “Landmark-based speech

recognition: Report of the 2004 Johns Hopkins summer workshop,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, 2005, pp. 213–216.

[4] M.Y. Chen, ”Acoustic correlates of English and

French nasalized vowels,” J. Acoust. Soc. Am., Vol. 102, No.4, Oct. 1997, pp. 2360-2370.

[5] R.A. Krakow, ”Non segmental influences on

velum movement patterns: Syllables, Sentences, Stress and Speaking Rate, ”In: M. K. Huffman, R.A. Krakow, Ed.,Phonetics and Phonology: Nasals,

Nasalization, and the Velum, Vol.5,San Diego:

Academic Press, 1993,pp. 87-116. [6] G. Fant, Acoustic Theory of Speech Production,

2nd Edition, Netherlands: Mouton, 1960. [7]S. Hawkins and K. N. Stevens, “Acoustic and

perceptual correlates of the non-nasal-nasal distinction for vowels,” J. Acoust. Soc. Am., Vol.77, No. 4,Apr. 1985, pp. 1560–1574.

[8] N.F.Chen, J.L.Slifka and K. N. Stevens. ”Vowel

Nasalization in American English: Acoustic Variability Due To Phonetic Context, “Speech Communication., Aug. 2007, pp. 905–908.

[9] D.A. Cairns, J.H.L. Hansen, and J.E. Riski, “A

noninvasive technique or detecting hyper nasal speech using a nonlinear operator, ”IEEE Trans. Biomed. Eng., Vol. 43, No. 1, Jan. 1996, pp. 35–45.

[10] J.Yuan,A. Seidl, and A. Cristia, “Automatic

detection and comparison of vowel nasalization in

Page 8: Nasalized Vowel Detection Based on a Threshold of an ... · The acoustic parameter RMGD is fed to a threshold based scheme for the detection of nasalized/oral vowels from the mixture

Vol. 6, No. 4 April 2015 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2015 CIS Journal. All rights reserved.

http://www.cisjournal.org

226

American English,” J. Acoust. Soc. Am., Vol.128, No. 4,Oct. 2010, pp. 2291.

[11] T.Pruthi, “Analysis, Vocal-tract Modelling, and

Automatic Detection of Vowel Nasalization,” Ph.D. thesis, University of Maryland, College Park, USA, 2007.

[12] J. Yuan and M. Liberman, “Automatic

measurement and comparison of vowel nasalization across languages.” Proceedings of the 17th International Congress of Phonetic Sciences (ICPhS XVII), Hong Kong, 17-21 Aug. 2011, pp. 2244-2247.

[13] D. O’Shaughnessy, “Speech Communications:

Human and Machine”, 2nd Edition, NY: Universities Presss, 2000.

[14] H.A. Murthy and B. Yegnanarayana, “Formant

extraction from minimum phase group delay function,” Speech Communication., Vol. 10, Aug. 1991, pp. 209–221.

[15] H.A. Murthy and V. Gadde, “The modified group

delay function and its application to phoneme recognition,” Proc. IEEE Int. Conf. Acoust.,Speech, and Signal Processing, Apr. 2003, pp. 68–71.

[16] P. Vijayalakshmi and M.R. Reddy, “Analysis of

hypernasality by synthesis, ”Proc. Int. Conf. Spoken Language Processing, Jeju Island, South Korea, Oct. 2004, pp. 525–528.

[17] TIMIT, “TIMIT acoustic-phonetic continuous

speech corpus, national institute of standards and technology speech disc 1-1.1, NTIS Order No. PB91-5050651996, October 1990,” 1990.

[18] T. Pruthi and C. Y. Espy-Wilson, ”Acoustic

Parameters for the Automatic Detection of Vowel Nasalization,” Interspaced, Antwerp, Aug. 2007,pp. 1925-1928.