Report VUV for Shifted Mar2011

Embed Size (px)

Citation preview

  • 8/6/2019 Report VUV for Shifted Mar2011

    1/28

    UTD-REP-01 Page 1

    Technical Report UTD-REP-01

    Evaluation of Voiced/Unvoiced Detection Algorithms

    for Frequency-Shifted Speech

    Jaewook Lee and Philipos Loizou

    March 2011

  • 8/6/2019 Report VUV for Shifted Mar2011

    2/28

    UTD-REP-01 Page 2

    I. Introduction

    Voiced/unvoiced classification is important for speech coding, recognition and

    enhancement. For that reason, various methods are developed for the robust classification. In this

    report, four feature extraction methods, autocorrelation coefficient (AC), pre-emphasized energy

    ratio (ER), zero crossing rate (ZCR) and high-to-full subband energy ration(SR), are used for

    voiced/unvoiced speech classification[3]. And Otsus method is used to select threshold level from

    histogram of AC, ER, ZCR and SR. For the final decision, short-time energy (STE) and its fixed

    threshold level are used in silence detection[5,6]. Semiautomatic tool for voiced/unvoiced

    detection is developed to obtain reliable voiced/unvoiced speech detection as a reference for test.

    10 IEEE corpus sentences are selected and their frequency are shifted to the range of 600 ~ 1500

    and range of -600 ~ -1500 respectively for test[7].

  • 8/6/2019 Report VUV for Shifted Mar2011

    3/28

    UTD-REP-01 Page 3

    II. Algorithms for Voiced/Unvoiced Detection

    A. Equation for Four Voiced/Unvoiced Detection Algorithms

    Autocorrelation Coefficient (AC):

    ()()

    ()(1)

    Pre-Emphasized Energy Rate (ER):

    ()()

    ()(2)

    Zero Crossing Rate (ZCR):

    (()( ) ) (3)

    Where () is the indicator function which is 1 if the argument A is true and 0 otherwise.

    Low-to-Full Subband Energy Ratio (SR):

    ()

    ()(4)

    Where () is low-pass filtered speech at 3kHz.

  • 8/6/2019 Report VUV for Shifted Mar2011

    4/28

    UTD-REP-01 Page 4

    B. Equation for Automatic Threshold Level Selection Algorithm

    Otsus method (OTSU):

    The optimum global threshold level

    can be obtained by the value offor which

    () is maximum.

    ()

    () (5-1)

    Where the between-class variance, for k=1,2,,N

    ()

    (()())

    ()(())(5-2)

    Where the global intensity mean,

    () (5-3)

    Where the cumulative means, () for k=1,2,,N

    () () (5-4)

    Where the cumulative sums, () for k=1,2,,N

    () () (5-5)

    Where the normalized histogram of input signal is p(i) for i=1,2,,N.

    Histogram for Normalized AC and Threshold Level Using Otsus Method:

    Figure 1. Histogram for (a) normalized AC, (b) normalized ER, (c) normalized ZCR and (d) normalized SR,

    and their threshold level which is selected by Otsus method.

    0 0.5 10

    5

    10

    15

    20

    25

    30

    0.69

    count

    (a)

    0 0.5 10

    5

    10

    15

    20

    25

    30

    0.46

    (b)

    0 0.5 10

    5

    10

    15

    20

    25

    30

    0.47

    (c)

    0 0.5 10

    5

    10

    15

    20

    25

    30

    0.6

    (d)

  • 8/6/2019 Report VUV for Shifted Mar2011

    5/28

    UTD-REP-01 Page 5

    Voiced/Unvoiced Detection Using 4 Methods with Automatically Selected Threshold Level:

    Figure 2. (a) Normalized AC with threshold level (0.68). (b) Normalized ER with threshold level (0.46). (c)

    Normalized ZCR with threshold level (0.47). (d) Normalized SR with threshold level (0.60). (e)

    Voiced/unvoiced detection using AC and its threshold level. (f) Voiced/unvoiced detection from AC.

    0 0.5 1 1.5 2-0.5

    0

    0.5

    data:

    (a)

    0 0.5 1 1.5 20

    0.5

    1

    normalizedAC

    (b)

    0 0.5 1 1.5 20

    0.5

    1

    normalizedER

    (c)

    0 0.5 1 1.5 20

    0.5

    1

    normalizedZCR

    (d)

    0 0.5 1 1.5 20

    0.5

    1

    normalizedSR

    (e)

    0 0.5 1 1.5 20

    0.5

    1

    VUVfromAC:

    (f)time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    6/28

    UTD-REP-01 Page 6

    C. Decision Making

    Short-Time Energy (STE) :

    () (6)

    STE with Fixed Threshold Level for Silence Detection (0.08 for unshifted, upshifted):

    Figure 3. (a) Waveform of sample sentence. (b) Short-time Energy with fixed threshold level (0.08). (c)

    Silence detection using STE with its threshold level.

    Final Decision for Voiced/Unvoiced Speech Detection:

    Figure 4. (a) Waveform of sample sentence. (b) Voiced/unvoiced detection using AC. (c) Silence detection

    using STE. (d) Final decision for voiced/unvoiced detection of sample sentence.

    0 0.5 1 1.5 2-0.5

    0

    0.5

    data:

    (a)

    0 0.5 1 1.5 20

    1

    2

    3

    STE:

    (b)

    0 0.5 1 1.5 20

    0.5

    1

    silencefromS

    TE:

    (c)time (sec)

    0 0.5 1 1.5 2-0.5

    0

    0.5

    data:

    (a)

    0 0.5 1 1.5 20

    0.5

    1

    VUVfromA

    C:

    (b)

    0 0.5 1 1.5 20

    0.5

    1

    silencefromS

    TE:

    (c)

    0 0.5 1 1.5 20

    0.5

    1

    finaldecisionfor

    VUV:

    (d)time sec

  • 8/6/2019 Report VUV for Shifted Mar2011

    7/28

    UTD-REP-01 Page 7

    II. Materials and Experimental Methods

    A. 10 IEEE Sample Sentences and Reference Voiced/Unvoiced Detection

    Table 1. 10 IEEE Sample Sentences for Test.

    # sentence sex Len. (s) Fs (kHz)

    1 The birch canoe slid on the smooth planks. M 2.8 25

    2 He knew the skill of the great young actress. M 3.5 25

    3 Her purse was full of useless trash. M 2.2 25

    4 Read verse out loud for pleasure. M 2.1 25

    5 Wipe the grease off his dirty face. M 2.2 25

    6 He wrote down a long list of items. F 2.9 25

    7 The drip of the rain made a pleasant sound. F 2.7 25

    8 Smoke poured out of every crack. F 2.5 25

    9 Hats are worn to tea and not to dinner. F 2.9 25

    10 The clothes dried on a thin wooden rack. F 2.9 25

    10 IEEE Sample Sentences and Voiced/Unvoiced Detection as a Reference:

    Figure 5. Waveform of 10 IEEE sample sentences and their voiced/unvoiced detection as a reference which

    are detected manually using spectrogram.

    1 2 3 4 5 6 7

    x 104

    0

    0.51

    1 2 3 4 5 6 7 8

    x 104

    0

    0.52

    0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

    x 104

    0

    0.53

    0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

    x 104

    0

    0.54

    0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

    x 104

    0

    0.55

    1 2 3 4 5 6 7

    x 104

    0

    0.56

    1 2 3 4 5 6

    x 104

    0

    0.57

    1 2 3 4 5 6

    x 104

    0

    0.58

    1 2 3 4 5 6 7

    x 104

    0

    0.59

    1 2 3 4 5 6 7

    0

    0.510

  • 8/6/2019 Report VUV for Shifted Mar2011

    8/28

    UTD-REP-01 Page 8

    B. Upshifted and Downshifted Speech

    Spectrogram for Unshifted and Shifted Speech:

    Figure 6. Spectrogram of sample sentence: (a) unshifted speech, (b) upshifted speech by 800Hz, (c) upshifted

    speech by 1200Hz, (d) downshifted speech by -800Hz and (e) downshifted speech by -1200Hz.

    (a)

    freq(Hz)

    0 0.5 1 1.5 20

    5000

    10000

    (b)

    freq(Hz)

    0 0.5 1 1.5 20

    5000

    10000

    (c)

    freq(Hz)

    0 0.5 1 1.5 20

    5000

    10000

    (d)

    freq

    (Hz)

    0 0.5 1 1.5 20

    5000

    10000

    (e)

    freq(Hz)

    time (sec)

    0 0.5 1 1.5 20

    5000

    10000

  • 8/6/2019 Report VUV for Shifted Mar2011

    9/28

    UTD-REP-01 Page 9

    C. Tool for Voiced/Unvoiced Detection As a Reference

    Initial voiced and unvoiced detection Using ER:

    Figure 7. Voiced/unvoiced detection using ER with Spectrogram.

    Manual Correction of Voiced/Unvoiced Classification as a Reference using spectrogram by clicking

    mouse button on the wrong detected point:

    Figure 8. Corrected voiced/unvoiced detection manually using spectrogram

  • 8/6/2019 Report VUV for Shifted Mar2011

    10/28

    UTD-REP-01 Page 10

    D. Performance Measurement

    Table 2. Definition of Symbols in Error Calculation

    HIT0 :Hit when unvoiced segment is correctly detected as an unvoiced. (unvoiced->unvoiced)

    FALSE0 :False alarm when unvoiced segment is to be detected as a voiced. (unvoiced->voiced)

    HIT1 :Hit when voiced segment is correctly detected as a voiced. (voiced->voiced)

    FALSE1 :False alarm when voiced segment is to be detected as an unvoiced. (voiced->unvoiced)

    Hit Rate:

    (7-1)

    False Alarm Rate:

    (7-2)

    Error Rate:

    (7-3)

  • 8/6/2019 Report VUV for Shifted Mar2011

    11/28

    UTD-REP-01 Page 11

    IV. Experimental Result

    A. Voiced/Unvoiced Detection for Unshifted Speech

    Voiced/Unvoiced Detection for Unshifted Male Sentence Using 4 Methods:

    Figure 9. (a) Reference voiced/unvoiced detection for male sentence (Her purse was full of useless trash).

    Voiced/unvoiced detection: (b) AC, (c)ER, (d) ZCR and (e) SR.

    Voiced/Unvoiced Detection for Unshifted Female Sentence Using 4 Methods:

    Figure 10. (a) Reference voiced/unvoiced detection for female sentence (Hats are worn to tea and not to

    dinner.). Voiced/unvoiced detection: (b) AC, (c)ER, (d) ZCR and (e) SR.

    0 0.5 1 1.5 20

    0.5

    1

    Ref.:

    (a)

    0 0.5 1 1.5 20

    0.5

    1

    AC:

    (b)

    0 0.5 1 1.5 2

    0

    0.5

    1

    ER:

    (c)

    0 0.5 1 1.5 20

    0.5

    1

    ZCR:

    (d)

    0 0.5 1 1.5 20

    0.5

    1

    SR:

    (e)time sec)

    0 0.5 1 1.5 2 2.50

    0.5

    1

    Ref.:

    (a)

    0 0.5 1 1.5 2 2.50

    0.5

    1

    AC:

    (b)

    0 0.5 1 1.5 2 2.50

    0.5

    1

    ER:

    (c)

    0 0.5 1 1.5 2 2.50

    0.51

    ZCR

    :

    (d)

    0 0.5 1 1.5 2 2.50

    0.5

    1

    SR:

    (e)time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    12/28

    UTD-REP-01 Page 12

    Comparison of Detection Result with Reference Detection:

    Table 3. Hit, False Alarm and Error Rate of Voiced/Unvoiced Detection for Unshifted Speech

    Hit Rate (%) False Alarm Rate (%) Error Rate (%)

    AC

    ER

    ZCR

    SR

    92.8222

    93.1485

    93.3116

    92.8222

    1.7833

    1.7833

    1.7833

    1.7833

    4.2474

    4.0984

    4.0238

    4.2474

    Error Rate of Each Method:

    Figure 11. Hit, false alarm and error rate for unshifted speech.

  • 8/6/2019 Report VUV for Shifted Mar2011

    13/28

    UTD-REP-01 Page 13

    B. Voiced/Unvoiced Detection for Frequency Upshifted Sentences

    Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using AC:

    Figure 12. Voiced/unvoiced detection for upshifted speech using AC in the frequency range from 600 to

    1500Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.51

    600Hz

    0 0.5 1 1.5 20

    0.51

    700Hz

    0 0.5 1 1.5 20

    0.51

    800Hz

    0 0.5 1 1.5 20

    0.5

    1

    900

    Hz

    0 0.5 1 1.5 20

    0.51

    1000Hz

    0 0.5 1 1.5 20

    0.51

    1100Hz

    0 0.5 1 1.5 20

    0.51

    1200Hz

    0 0.5 1 1.5 20

    0.51

    1300Hz

    0 0.5 1 1.5 20

    0.51

    1400Hz

    0 0.5 1 1.5 20

    0.51

    1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    14/28

    UTD-REP-01 Page 14

    Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using ER:

    Figure 13. Voiced/unvoiced detection for upshifted speech using ER in the frequency range from 600 to 1500

    Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.5

    1

    600

    Hz

    0 0.5 1 1.5 20

    0.51

    700Hz

    0 0.5 1 1.5 20

    0.51

    800Hz

    0 0.5 1 1.5 20

    0.51

    900Hz

    0 0.5 1 1.5 20

    0.5

    1

    1000

    Hz

    0 0.5 1 1.5 20

    0.51

    1100Hz

    0 0.5 1 1.5 20

    0.51

    1200Hz

    0 0.5 1 1.5 20

    0.51

    1300Hz

    0 0.5 1 1.5 20

    0.51

    1400Hz

    0 0.5 1 1.5 20

    0.51

    1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    15/28

    UTD-REP-01 Page 15

    Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using ZCR:

    Figure 14. Voiced/unvoiced detection for upshifted speech using ZCR in the frequency range from 600 to

    1500 Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.5

    1

    600Hz

    0 0.5 1 1.5 20

    0.51

    700Hz

    0 0.5 1 1.5 20

    0.51

    800Hz

    0 0.5 1 1.5 20

    0.51

    900Hz

    0 0.5 1 1.5 20

    0.511000H

    z

    0 0.5 1 1.5 20

    0.51

    1100Hz

    0 0.5 1 1.5 20

    0.51

    1200Hz

    0 0.5 1 1.5 20

    0.51

    1300Hz

    0 0.5 1 1.5 20

    0.51

    1400Hz

    0 0.5 1 1.5 20

    0.51

    1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    16/28

    UTD-REP-01 Page 16

    Voiced/Unvoiced Detection for Frequency Upshifted Sentences Using SR:

    Figure 15. Voiced/unvoiced detection for upshifted speech using SR in the frequency range from 600 to 1500

    Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.5

    1

    600Hz

    0 0.5 1 1.5 20

    0.51

    700Hz

    0 0.5 1 1.5 20

    0.51

    800Hz

    0 0.5 1 1.5 20

    0.51

    900Hz

    0 0.5 1 1.5 20

    0.51

    1000Hz

    0 0.5 1 1.5 20

    0.51

    1100Hz

    0 0.5 1 1.5 20

    0.51

    1200Hz

    0 0.5 1 1.5 20

    0.51

    1300Hz

    0 0.5 1 1.5 20

    0.51

    1400H

    z

    0 0.5 1 1.5 20

    0.51

    1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    17/28

    UTD-REP-01 Page 17

    Comparison of Detection Result with Reference Detection:

    Table 4. Hit, False Alarm and Error Rate for Upshifted speech.

    AC ER ZCR SR

    Hit False Error Hit False Error Hit False Error Hit False Error

    600

    700

    800

    900

    1000

    1100

    1200

    1300

    1400

    1500

    93.31

    93.14

    93.14

    93.31

    93.14

    93.14

    93.47

    93.31

    93.47

    93.31

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    4.09

    4.17

    4.17

    4.09

    4.17

    4.17

    4.02

    4.09

    4.02

    4.09

    93.47

    93.31

    93.31

    93.47

    93.31

    93.31

    93.47

    93.31

    93.47

    93.31

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    4.02

    4.09

    4.09

    4.02

    4.09

    4.09

    4.02

    4.09

    4.02

    4.09

    93.96

    93.47

    93.47

    93.96

    93.63

    93.80

    93.96

    93.80

    93.63

    93.63

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    3.80

    4.02

    4.02

    3.80

    3.94

    3.87

    3.80

    3.87

    3.94

    3.94

    93.47

    93.31

    93.31

    93.47

    93.31

    93.31

    93.47

    93.31

    93.47

    93.31

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    1.92

    4.02

    4.09

    4.09

    4.02

    4.09

    4.09

    4.02

    4.09

    4.02

    4.09

    Error Rate for Each Upshifted Frequency Level:

    Figure 16. Hit, false alarm and error rate for upshifted speech.

    600 700 800 900 1000 1100 1200 1300 1400 150090

    92

    94

    96

    hitrate(%)

    AC

    ER

    ZCR

    SR

    600 700 800 900 1000 1100 1200 1300 1400 15000

    2

    4

    6

    falsealarmr

    ate(%)

    600 700 800 900 1000 1100 1200 1300 1400 15000

    2

    4

    6

    errorrate(%)

    frequency level (Hz)

  • 8/6/2019 Report VUV for Shifted Mar2011

    18/28

    UTD-REP-01 Page 18

    C. Voiced/Unvoiced Detection for Frequency Downshifted Sentences

    Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using AC:

    Figure 17. Voiced/unvoiced detection for downshifted speech using AC in the frequency range from -600 to-1500 Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.51

    -600Hz

    0 0.5 1 1.5 20

    0.51

    -700Hz

    0 0.5 1 1.5 20

    0.51

    -800Hz

    0 0.5 1 1.5 20

    0.51

    -900Hz

    0 0.5 1 1.5 20

    0.51

    -1000Hz

    0 0.5 1 1.5 20

    0.51

    -1100Hz

    0 0.5 1 1.5 2

    00.5

    1

    -1200Hz

    0 0.5 1 1.5 20

    0.51

    -1300Hz

    0 0.5 1 1.5 20

    0.51

    -1400Hz

    0 0.5 1 1.5 20

    0.51

    -1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    19/28

    UTD-REP-01 Page 19

    Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using ER:

    Figure 18. Voiced/unvoiced detection for downshifted speech using ER in the frequency range from -600 to

    -1500 Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.5

    1

    -600

    Hz

    0 0.5 1 1.5 20

    0.51

    -700Hz

    0 0.5 1 1.5 20

    0.51

    -800Hz

    0 0.5 1 1.5 20

    0.51

    -900Hz

    0 0.5 1 1.5 20

    0.51

    -1000

    Hz

    0 0.5 1 1.5 20

    0.51

    -1100Hz

    0 0.5 1 1.5 20

    0.51

    -1200Hz

    0 0.5 1 1.5 20

    0.51

    -1300Hz

    0 0.5 1 1.5 20

    0.51

    -1400H

    z

    0 0.5 1 1.5 20

    0.51

    -1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    20/28

    UTD-REP-01 Page 20

    Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using ZCR:

    Figure 19. Voiced/unvoiced detection for downshifted speech using ZCR in the frequency range from -600 to

    -1500 Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 200.5

    1

    -60

    0Hz

    0 0.5 1 1.5 20

    0.51

    -700Hz

    0 0.5 1 1.5 20

    0.51

    -800Hz

    0 0.5 1 1.5 20

    0.51

    -900Hz

    0 0.5 1 1.5 200.5

    1

    -10

    00Hz

    0 0.5 1 1.5 20

    0.51

    -1100Hz

    0 0.5 1 1.5 20

    0.51

    -1200Hz

    0 0.5 1 1.5 20

    0.51

    -1300Hz

    0 0.5 1 1.5 20

    0.51

    -1400Hz

    0 0.5 1 1.5 20

    0.51

    -1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    21/28

    UTD-REP-01 Page 21

    Voiced/Unvoiced Detection for Frequency Downshifted Sentences Using SR:

    Figure 20. Voiced/unvoiced detection for downshifted speech using SR in the frequency range from -600 to

    -1500 Hz.

    0 0.5 1 1.5 20

    0.51

    Ref.

    0 0.5 1 1.5 20

    0.5

    1

    -600Hz

    0 0.5 1 1.5 20

    0.51

    -700Hz

    0 0.5 1 1.5 20

    0.51

    -800Hz

    0 0.5 1 1.5 20

    0.51

    -900Hz

    0 0.5 1 1.5 20

    0.5

    1

    -1000Hz

    0 0.5 1 1.5 20

    0.51

    -1100Hz

    0 0.5 1 1.5 20

    0.51

    -1200Hz

    0 0.5 1 1.5 20

    0.51

    -1300Hz

    0 0.5 1 1.5 20

    0.51

    -1400Hz

    0 0.5 1 1.5 20

    0.51

    -1500Hz

    time (sec)

  • 8/6/2019 Report VUV for Shifted Mar2011

    22/28

    UTD-REP-01 Page 22

    Comparison of Detection Result with Reference Detection:

    Table 5. Hit, False Alarm and Error Rate for Downshifted Speech.

    AC ER ZCR SR

    Hit False Error Hit False Error Hit False Error Hit False Error

    -600

    -700

    -800

    -900

    -1000

    -1100

    -1200

    -1300

    -1400

    -1500

    73.40

    77.16

    79.77

    79.77

    78.62

    78.95

    78.79

    74.38

    79.77

    79.77

    3.97

    5.62

    6.03

    7.13

    8.36

    10.19

    10.21

    11.24

    12.75

    13.71

    14.30

    13.48

    12.51

    13.11

    14.30

    15.12

    15.27

    17.80

    16.16

    16.69

    76.34

    80.26

    83.36

    83.03

    82.05

    82.87

    83.52

    78.95

    84.01

    83.36

    3.97

    5.76

    6.58

    7.27

    9.05

    10.69

    11.11

    12.34

    14.67

    15.22

    12.96

    12.14

    11.17

    11.69

    13.11

    13.63

    13.56

    16.31

    15.27

    15.87

    76.50

    79.44

    83.52

    83.03

    82.21

    82.70

    81.72

    78.46

    83.68

    83.84

    4.52

    5.76

    6.85

    7.95

    8.64

    10.97

    10.56

    12.20

    14.95

    16.46

    13.18

    12.51

    11.25

    12.07

    12.81

    13.85

    14.08

    16.46

    15.57

    16.31

    75.20

    78.79

    82.38

    81.72

    80.42

    80.58

    81.23

    76.83

    82.05

    81.89

    4.25

    5.62

    6.85

    7.13

    8.77

    10.56

    10.83

    11.93

    13.58

    14.54

    13.63

    12.74

    11.77

    12.22

    13.71

    14.60

    14.45

    17.06

    15.57

    16.16

    Error Rate on Each Downshifted Frequency Level:

    Figure 21. Hit, false alarm and error rate for downshifted speech.

  • 8/6/2019 Report VUV for Shifted Mar2011

    23/28

    UTD-REP-01 Page 23

    VI. Conclusions and Planned Activity

    Four feature extraction algorithms, autocorrelation coefficient (AC), pre-emphasized

    energy ratio (ER), zero crossing rate (ZCR) and high-to-full subband energy ratio (SR), are used

    for voiced/unvoiced classification, and their threshold levels are automatically selected using

    Otsus method. Short-time Energy is used as a silence detection method with fixed threshold

    level to make final decision for voiced/unvoiced classification. 10 IEEE corpus sentences are

    used for test, and their reference voiced/unvoiced detection are manually obtained using

    spectrogram. In unshifted and upshifted speech, all four methods have error rate under 4.3% in

    all frequency range from 600 Hz to 1500 Hz. In the frequency downshifted speech, all four

    methods have error rate of 11% to 18% in the frequency range from -600Hz to -1500Hz. And

    three activities to improve performance of voiced/unvoiced detection are proposed as a

    planned activity below:

    Multiple Number of Threshold Level:

    Two or multiple threshold levels are detected. Lower level is used for voiced/unvoiced

    detection, and upper level is used to make sure it is voiced or unvoiced. Each level will be

    detected to use multiple times of Otsus method [5,6].

    Reliable Detection for Weak Voiced Speech:

    If there are unvoiced segment near voiced utterance, it is likely to be voiced which is

    weak between the voiced and unvoiced utterance. STE will be used to detect voiced speech,

    and AC,ER,ZCR and SR will be used to detect unvoiced speech. Then they will be combined

    together to detect weak voiced speech[4].

    New Approach to Decision Making:

    Each voiced, unvoiced and silence speech are manually detected, and their statistical

    information such as mean and variance is obtained from their histogram. Final decision will be

    made by using the statistical information. This is Bayesian approach to voiced/unvoiced

    detection[4].

  • 8/6/2019 Report VUV for Shifted Mar2011

    24/28

    UTD-REP-01 Page 24

    References

    [1] John G. Proakis, Digital Signal Processing, 4th, Pearson

    [2] Philip C. Loizou, Speech Enhancement Theory and Practice, CRC

    [3] A.M.Kondoz, Digital Speech: coding for low bit rate communication systems, Wiley

    [4] L.R.Rabiner, R.W.Schafer, Theory and Applications of Digital Speech Processing, Prentice Hall

    [5] R.C.Gonzalez, R.E.Woods, Digital Image Processing, Pearson

    [6] Otsu, N.,A Threshold Selection Method from Gray-Level Histograms, IEEE Transactions on Systems, Man,and Cybernetics, Vol. 9, No. 1, 1979, pp. 62-66.

    [7] IEEE Subcommittee (1969). IEEE Recommended Practice for Speech Quality Measurements.IEEE Trans.

    Audio and Electroacoustics, AU-17(3), 225-246.

    Matlab Code

    Matlab Function for Autocorrelation Coefficient (AC) with Short-time Energy (STE):

    function [ac,ste,n]=ac(data,win_size)

    n=floor(length(data)/win_size);

    data_fit=data(1:n*win_size);

    data_2=data(1:n*win_size+1);

    for i=1:n

    data_win=data_fit(1+win_size*(i-1):win_size*i);

    data_post=data_2(2+win_size*(i-1):1+win_size*i);

    ac(i)=sum(data_post.*data_win)/sum(data_post.^2);

    ste(i)=sum(data_win.^2);

    end

    end

  • 8/6/2019 Report VUV for Shifted Mar2011

    25/28

    UTD-REP-01 Page 25

    Matlab Function for Pre-emphasize Energy Ratio (ER):

    function [er,ste,n]=er(data,win_size)

    n=floor(length(data)/win_size);

    data_fit=data(1:n*win_size);

    data_2=data(1:n*win_size+1);

    for i=1:n

    data_win=data_fit(1+win_size*(i-1):win_size*i);

    data_post=data_2(2+win_size*(i-1):1+win_size*i);

    er(i)=sum(abs(data_post-data_win))/sum(abs(data_post));

    ste(i)=sum(data_win.^2);

    end

    end

    Matlab Function for Zero Crossing Rate (ZCR):

    function [zcr,ste,n]=zcr(data,win_size)n=floor(length(data)/win_size);

    data_fit=data(1:n*win_size);

    data_2=data(1:n*win_size+1);

    for i=1:n

    data_win=data_fit(1+win_size*(i-1):win_size*i);

    data_post=data_2(2+win_size*(i-1):1+win_size*i);

    [row,column]=find((data_win.*data_post)

  • 8/6/2019 Report VUV for Shifted Mar2011

    26/28

    UTD-REP-01 Page 26

    Semiautomatic Tool for Voiced/Unvoiced Detection using Spectrogram:

    filename='C:\Users\Owner\Desktop\P4\10sentences\sp03';

    cd('C:\Users\Owner\Desktop\P4\10sentences');

    [num,txt]=xlsread('10sentences.xlsx'); sentence=txt{3};

    [data,fs]=wavread(filename);win=0.02*fs;

    [er,ste,n]=er(data,win); er=er/max(er);

    thres_sil=0.08;

    thres_er=graythresh(er);

    vuv_er=ones(1,n);

    for i=1:n

    if ste(i)thres_er vuv_er(i)=0; end

    end

    for i=1:n vuv(1+win*(i-1):win*i)=vuv_er(i); end

    %%

    subplot(2,1,1); area(vuv,'edgecolor','c','facecolor','c'); hold on;

    subplot(2,1,1); plot(data(1:win*n)+0.4); ylim([0,0.9]); hold off;

    title(sentence,'fontsize',12);

    subplot(2,1,2); specgram(data(1:win*n));

    for j=1:100

    [x,y]=ginput(1);

    for i=0:n

    if (x>=i*win+1 && x

  • 8/6/2019 Report VUV for Shifted Mar2011

    27/28

    UTD-REP-01 Page 27

    'C:\Users\Owner\Desktop\P4\10sentences\1500'};

    cd('C:\Users\Owner\Desktop\P4\10sentences');

    file=dir('*.wav'); file_ref=dir('*.mat');

    filenames={file.name}'; filenames_ref={file_ref.name}';

    %%

    for k=1:10

    cd(pathnames{k})

    for i=1:10

    [data,fs]=wavread(filenames{i});

    win=0.02*fs;

    [ac,ste,n(i)]=ac(data,win);

    [er]=er(data,win);

    [zcr]=zcr(data,win);

    [sr]=sr(data,win,fs);

    ac=ac/max(ac);

    er=er/max(er);

    zcr=zcr/max(zcr);

    sr=sr/max(sr);

    thres_sil=0.08;

    thres_ac=graythresh(ac);

    thres_er=graythresh(er);

    thres_zcr=graythresh(zcr);

    thres_sr=graythresh(sr);

    for j=1:n(i)if (ste(j)

  • 8/6/2019 Report VUV for Shifted Mar2011

    28/28

    for l=1:4

    if l==1 vuv_1=vuv_ac; end

    if l==2 vuv_1=vuv_er; end

    if l==3 vuv_1=vuv_zcr; end

    if l==4 vuv_1=vuv_sr; end

    for k=1:10

    hit0=0; false0=0; hit1=0; false1=0;

    for i=1:10

    for j=1:n(i)

    if (vuv(i,j)==0 && vuv_1(i,j,k)==0) hit0=hit0+1; end

    if (vuv(i,j)==0 && vuv_1(i,j,k)==1) false0=false0+1; end

    if (vuv(i,j)==1 && vuv_1(i,j,k)==1) hit1=hit1+1; end

    if (vuv(i,j)==1 && vuv_1(i,j,k)==0) false1=false1+1; end

    end

    end

    hit(l,k)=hit0/(hit0+false0)*100;

    false(l,k)=false1/(hit1+false1)*100;

    error(l,k)=(false0+false1)/sum(n)*100;

    end

    end