1
Works on measuring sonority We assume that the sonority is related to the consistent temporal pattern in sub-band energies captured by spectro-temporal correlation (STC) [2], which has been used to compute Temporal Correlation and Selected Sub-Band Correlation (TCSSBC) [3] contour. STC has been shown to be effective in exploiting the formant like structures in the spectral domain with the help of short-time energy contours of 19 sub- bands. However, TCSSBC introduces peaks in the less sonorous regions (as shown below) due to using all 19 sub-bands. We modify TCSSBC by selecting a few sub-bands to reduce its peaky nature in those regions, and call this as sonorous TCSSBC (S-TCSSBC) Proposed approach Block diagram representing the proposed approach Sonority based feature computation: AUTOMATIC DETECTION OF SYLLABLE STRESS USING SONORITY BASED PROMINENCE FEATURES FOR PRONUNCIATION EVALUATION Chiranjeevi Yarra 1 , Om D. Deshmukh 2 , Prasanta Kumar Ghosh 1 1 Electrical Engineering, Indian Institute of Science (IISc), Bangalore-560012, India; 2 Xerox Research Center India, Bangalore, India. chiranjeevi/[email protected], [email protected], [email protected] Problem definition Experimental set-up We consider unweighted accuracy (UA) and weighted accuracy (WA) as objective measures. We consider work by Tepperman et al. [4] as the baseline method. Experiments are conducted on ISLE corpus containing 7834 sentences. We perform the experiments under two setups -- 1) five fold cross validation 2) as in baseline. In the cross validation, we use three fold for training, one fold for feature selection and one fold for testing. We find the optimal sub-bands using one fold selected randomly from training set, in which half of the data is selected for SVM training and remaining for selecting the sub-bands. We select STC parameters identical to work by Wang et al. [2]. We use SVM classifier with RBF kernel for the classification task with the complexity parameter (C) equal to 1.0 and gamma (γ) equal to 1/number of features. In the post processing, we use estimated labels and decision scores from SVM classifier. Results Optimal sub-bands selected (black colored boxes) for two level features ADFs are selected in both syllable and syllable nuclei level features. UAs estimated from the baseline as well as from the proposed approach without & with post processing (WPP & WoPP). S-TCSSBC with optimal features performs better than with all features as well as than TCSSBC. References 1. Alan Cruttenden, Gimson’s pronunciation of English, Routledge, 2014. 2. Supriya Nagesh, Chiranjeevi Yarra, Om D Deshmukh, and Prasanta Kumar Ghosh, “A robust speech rate estimation based on the activation profile from the selected acoustic unit dictionary,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5400–5404, 2016. 3. Dagen Wang and Shrikanth S Narayanan, “Robust speech rate estimation for spontaneous speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2190–2201, 2007. 4. Joseph Tepperman and Shrikanth Narayanan, “Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners.,” IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp.937–940, 2005. This work is supported by “Prime Minister Fellowship for Doctoral Research”, and we are thankful to “Xerox Research Center India” for the fellowship. Conclusion & future work Sonority based feature contour is proposed for automatic syllable stress detection task unlike traditional short-time energy contour. The contour is computed by combining the sonority motivated cues with sub- band short-time energy contours reflecting prominence measures. Experiments with ISLE corpus reveal that the proposed method improves the stress detection performance compared to baseline scheme. Future work includes the use of the proposed features for the stress detection task in the native English speech as well as non-native English speech from the nativities other than German and Italian. How sonority is useful? Sonority is referred to as the carrying power of individual sounds in a word or a longer utterance. The carrying power is measured based on the sonorous hierarchy of various classes of sounds [1]. Below figure shows that the short-time energy contours have larger variabilities, but not the proposed hypothetical sonority contour. Combined sonority cues and short-time energy could discriminate stressed and unstresed syllables better. S-TCSSBC STC Split into syllable segments Sonority based features SVM classifier Post processing Speech signal 0 0 0 Syllable boundaries Forward sub-band selection Forward feature selection Word boundaries Sonority based Features S-TCSSBC Syllable boundaries Learnt during training 1: Stressed syllable 0: unstressed syllable Correct detection Incorrect detection D 1 >D 2 >D 3 D i : Decision score 1 0 0 Short time energy contour Sonority hierarchy contour Phone boundaries n i I s z r i t Syllable nuclei level features Syllable level features Combined features WoPP WPP WoPP WPP WoPP WPP T C S S B C S - T C S SB C S - T C S SB C # Unweighted Accuracy (UA) Weighted Accuracy (WA) Baseline S-TCSSBC # WoPP WPP WoPP WPP GER 85.57 85.81 84.29 87.53 ITA 82.57 83.17 83.73 86.26 Open Vowels Closed Vowels Glides Liquids Nasals Fricatives Affricates Stops 0 1 20-dim features for each syllable S i 10-dim syllable level features using x {x is S-TCSSBC within S i } 10-dim syllable nuclei level features using x 1 {x 1 is S-TCSSBC within the syllable nuclei of S i } 5-dim SFs 3-dim TFs 2-dim ADFs 5-dim SFs 3-dim TFs 2-dim ADFs Strength based features (SFs) Temporal variability based features (TFs) Area & duration based features (ADFs) Let z is equal to either x or x 1 of length N 1. Mean ( ̂ 2. Standard deviation ( ̂ ) 3. Geometric mean ( ) 4. Range (max(z)-min(z)) 5. Median of z , ; y is resampled values of x or x 1 to a fixed length. 1. 2. 3. κ A i , d i are area & duration under x or x 1 of S i . 1. ! " # $ # % &# &# 2. ( ! ) * $ * % &* &* S 1 S 2 S 3 d 1 d 2 d 3 A 2 A 1 A 3 p Short time energy contour TCSSBC Phone boundaries Sub-band Index s I z v I k t I t S-TCSSBC Unstressed Stressed TELEVISION t ɛ . l ə . v i . ʒ ə n Sonority levels Compared to unstressed syllables, stressed syllables are perceptually more prominent 1. Intensity 2. Duration 3. Pitch Features Classifier Features are proposed by incorporating relative sonority levels in the prominence measures for syllable stress detection. SFs TFs ADFs Selected features Syllable nuclei level features Syllable level features Fold Index 1 2 3 4 5 Sub-band Index Sub-band Index Fold Index a) Syllable level features b) Syllable nuclei level features 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

AUTOMATIC DETECTION OF SYLLABLE STRESS USING …

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AUTOMATIC DETECTION OF SYLLABLE STRESS USING …

Works on measuring sonority

We assume that the sonority is related to the consistent temporal pattern in

sub-band energies captured by spectro-temporal correlation (STC) [2], which

has been used to compute Temporal Correlation and Selected Sub-Band

Correlation (TCSSBC) [3] contour.

STC has been shown to be effective in exploiting the formant like structures

in the spectral domain with the help of short-time energy contours of 19 sub-

bands.

However, TCSSBC introduces peaks in the less sonorous regions (as shown

below) due to using all 19 sub-bands.

We modify TCSSBC by selecting a few sub-bands to reduce its peaky nature

in those regions, and call this as sonorous TCSSBC (S-TCSSBC)

Proposed approach

Block diagram representing the proposed approach

Sonority based feature computation:

AUTOMATIC DETECTION OF SYLLABLE STRESS USING SONORITY BASED PROMINENCE

FEATURES FOR PRONUNCIATION EVALUATIONChiranjeevi Yarra1, Om D. Deshmukh2, Prasanta Kumar Ghosh1

1Electrical Engineering, Indian Institute of Science (IISc), Bangalore-560012, India; 2Xerox Research Center India, Bangalore, India. chiranjeevi/[email protected], [email protected], [email protected]

Problem definition

Experimental set-up

We consider unweighted accuracy (UA) and weighted accuracy (WA) as objective measures.

We consider work by Tepperman et al. [4] as the baseline method.

Experiments are conducted on ISLE corpus containing 7834 sentences.

We perform the experiments under two setups -- 1) five fold cross validation 2) as in baseline.

In the cross validation, we use three fold for training, one fold for feature selection and one fold

for testing. We find the optimal sub-bands using one fold selected randomly from training set, in

which half of the data is selected for SVM training and remaining for selecting the sub-bands.

We select STC parameters identical to work by Wang et al. [2].

We use SVM classifier with RBF kernel for the classification task with the complexity

parameter (C) equal to 1.0 and gamma (γ) equal to 1/number of features.

In the post processing, we use estimated labels and decision scores from SVM classifier.

Results

Optimal sub-bands selected (black colored boxes) for two level features

ADFs are selected in both syllable and

syllable nuclei level features.

UAs estimated from the baseline as

well as from the proposed approach

without & with post processing (WPP

& WoPP).

S-TCSSBC with optimal features performs better than with all features as

well as than TCSSBC.

References1. Alan Cruttenden, Gimson’s pronunciation of English, Routledge, 2014.

2. Supriya Nagesh, Chiranjeevi Yarra, Om D Deshmukh, and Prasanta Kumar Ghosh, “A robust speech rate

estimation based on the activation profile from the selected acoustic unit dictionary,” IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5400–5404, 2016.

3. Dagen Wang and Shrikanth S Narayanan, “Robust speech rate estimation for spontaneous speech,” IEEE

Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2190–2201, 2007.

4. Joseph Tepperman and Shrikanth Narayanan, “Automatic syllable stress detection using prosodic features

for pronunciation evaluation of language learners.,” IEEE International Conference on Acoustics Speech

and Signal Processing (ICASSP), pp.937–940, 2005.

This work is supported by “Prime Minister Fellowship for Doctoral Research”, and we are thankful to “Xerox Research Center India” for the fellowship.

Conclusion & future work

Sonority based feature contour is proposed for automatic syllable stress

detection task unlike traditional short-time energy contour.

The contour is computed by combining the sonority motivated cues with sub-

band short-time energy contours reflecting prominence measures.

Experiments with ISLE corpus reveal that the proposed method improves the

stress detection performance compared to baseline scheme.

Future work includes the use of the proposed features for the stress detection

task in the native English speech as well as non-native English speech from the

nativities other than German and Italian.

How sonority is useful?Sonority is referred to as the carrying power of individual sounds in a word or a

longer utterance. The carrying power is measured based on the sonorous

hierarchy of various classes of sounds [1].

Below figure shows that the short-time energy contours have larger variabilities,

but not the proposed hypothetical sonority contour.

Combined sonority cues and short-time energy could discriminate stressed and

unstresed syllables better.

S-TCSSBC

STC

Split

into

syllable

segments

Sonority

based

features

SVM

classifier

Post

processing

Speech

signal

0 0 0

Syllable boundaries

Forward sub-band selection Forward feature selection

Word

boundaries

Sonority based FeaturesS-TCSSBC Syllable boundaries

Learnt during training

1: Stressed syllable 0: unstressed syllable

Correct detection Incorrect detection

D1>D2>D3

Di: Decision score

1 0 0

Short time energy contour Sonority hierarchy contour Phone boundaries

n i Is z r i

t

Syllable nuclei level features Syllable level features Combined features

WoPP WPP WoPP WPP WoPP WPPT C S S B C S - T C S S B C S - T C S S B C #

Un

wei

gh

ted

Acc

ura

cy

(UA

)

Wei

gh

ted

Acc

ura

cy

(WA

)

Baseline S-TCSSBC#

WoPP WPP WoPP WPP

GER 85.57 85.81 84.29 87.53

ITA 82.57 83.17 83.73 86.26

Open Vowels

Closed Vowels

Glides

Liquids

Nasals

Fricatives

Affricates

Stops

0 1 20-dim features for each syllable Si

10-dim syllable level features using x

{x is S-TCSSBC within Si}

10-dim syllable nuclei level features using x1

{x1 is S-TCSSBC within the syllable nuclei of Si}

5-dim SFs 3-dim TFs 2-dim ADFs 5-dim SFs 3-dim TFs 2-dim ADFs

Strength based

features (SFs)

Temporal variability

based features (TFs)

Area & duration

based features (ADFs)

Let z is equal to either

x or x1 of length N

1. Mean (�̂ � �

�∑��

2. Standard deviation

(�

�∑� �̂��)

3. Geometric mean

( ∏�

)

4. Range

(max(z)-min(z))

5. Median of z

� � �� �

∑ � �, � � ∑����; y is

resampled values of x or x1 to a fixed length.

1. � � ∑ � � ����

2. � ��

��∑ � � ����

3. κ ��

��∑ � � ����

Ai , di are area & duration

under x or x1 of Si.

1. !" �#$

#%&#'&#�

2. (!) �*$

*%&*'&*�

S1S2 S3

d1 d2 d3

A2

A1A3

p

Short time energy contour TCSSBC Phone boundaries

Su

b-b

an

d I

nd

ex

s I z v I k t I ∫t

S-TCSSBC

Unstressed

Stressed

TELEVISIONt ɛ . l ə . v i . ʒ ə n

Sonority levels

Compared to unstressed

syllables, stressed

syllables are perceptually

more prominent

1. Intensity

2. Duration

3. Pitch

Features Classifier

Features are proposed by incorporating

relative sonority levels in the prominence

measures for syllable stress detection.

SFs TFs ADFsSelected featuresSyllable nuclei level features

Syllable level features

Fo

ld I

nd

ex1

2

3

4

5

Sub-band Index Sub-band Index

Fo

ld I

nd

ex

a) Syllable level features b) Syllable nuclei level features

1 2

3

4

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19