30
I I T B o m b a y a r j a y a n i @ e e . i i t b . a c . i n CN 2008 - International Conference on Signal Processing, Communications and Networking 1/30 Intro. Clear speech Trans.Det . Mod. Exp. Res. Sum. Automated Detection of Transition Segments for Intensity and Time-Scale Modification for Speech Intelligibility Enhancement by A. R. Jayan, P. C. Pandey, P. K. Lehana EE Dept, IIT Bombay 5 th January, 2008

PAPER OUTLINE

  • Upload
    tom

  • View
    40

  • Download
    1

Embed Size (px)

DESCRIPTION

Automated Detection of Transition Segments for Intensity and Time-Scale Modification for Speech Intelligibility Enhancement by A. R. Jayan, P. C. Pandey, P. K. Lehana EE Dept, IIT Bombay 5 th January, 2008. PAPER OUTLINE. 1. Introduction 2. Acoustic Properties of Clear Speech - PowerPoint PPT Presentation

Citation preview

Page 1: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

1/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Automated Detection of Transition Segments for Intensity and Time-Scale Modification for Speech

Intelligibility Enhancementby

A. R. Jayan, P. C. Pandey, P. K. Lehana

EE Dept, IIT Bombay5th January, 2008

Page 2: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

2/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

PAPER OUTLINE

1. Introduction

2. Acoustic Properties of Clear Speech

3. Automated Detection of Transition Segments

4. Intensity and Time-Scale Modification

5. Experimental Results

6. Summary and Conclusion

Page 3: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

3/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

INTRODUCTIONSpeech landmarks Regions in speech containing important information for speech perception Associated with spectral transitions Most of the landmarks coincide with phoneme boundaries

Landmarks types1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators

2. Abrupt (A) -Fast glottal or velum activity

3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction

4. Vocalic (V) - Vowel landmarks, oral cavity maximally open, maximum energy, F1

Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)

Intro. 1/2

Page 4: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

4/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Objective To improve speech intelligibility in quiet and noisy environments

Automated detection of landmarks

Speech modification using acoustic properties of clear speech

LandmarksIntro. 2/2

Page 5: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

5/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

ACOUSTIC PROPERTIS OF CLEAR SPEECH

Clear speech: speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments

Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm

‘the book tells a story’

‘the boy forgot his book’

Conversational Clear

Intelligibility of clear speech▪ More intelligible for different classes of listeners & listening conditions▪ Picheny et al. (1985): ~17% more intelligible than conversational speech

Clear speech 1/5

Page 6: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

6/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Acoustic properties of clear speechPicheny et al. (1986)

Sentence level• Reduced speaking rate (conv: 200 wpm, clr: 100 wpm)• Larger variation in fundamental frequency • Increased number of pauses, more pause durations

Word level• Less sound deletions• More sound insertions

Phonetic level• Context dependent, non-linear increase in segment durations• More targeted vowel formants• Increase in consonant intensity

Clear speech 2/5

Page 7: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

7/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Clear speech 3/5

Acoustic cues in clear speech are more robust and discriminable

Speech intelligibility of conversational speech can be improved by incorporating properties of clear speech

Consonant-vowel intensity ratio (CVR) enhancementIncreasing the ratio of rms energy of consonant segment to nearby vowel

Consonant duration enhancementIncreasing VOT, burst duration, formant transition duration

Difficulties Detection of regions for modification Performing modification with low signal processing artifacts

Page 8: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

8/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Earlier studies on CVR enhancement House et al. (1965): MRT, high scores for high consonant level

Gordon-Salant (1986): CVR +10dB, 19 CV, Elderly SNHI, +16% Guelke (1987): Burst intensity +17 dB, stop CV, NH, +40%

Montgomery et al. (1987): CVR -20 dB to +9 dB, CVC, NH, SNHI, no significant loudness increase Freyman & Nerbonne (1989): Equated consonant levels across talkers, CV

syllables, NH, +12%

Thomas & Pandey (1996): CVR +3 to +12 dB, CV & VC, NH, +16% Kennedy et al. (1997): CE 0-24 dB, VC, SNHI, max CE: 8.3 dB (voiced), 10.7 dB (unvoiced) Hazan & Simpson (1998): Burst +12 dB, fric. +6 dB, nas. +6 dB filtering, VCV, SUS, NH, +12%

Clear speech 4/5

Page 9: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

9/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Earlier studies on duration enhancement Gordon-Salant (1986): DUR +100%, marginal improvement Thomas & Pandey (1996): BD +100%, FTD +50%, VOT +100% BD, FTD → improved scores, VOT → degraded Vaughan et al. (2002): Unvoiced consonants expanded by 1.2, 1.4 1.4 effective in noisy condition

Nejime & Moore (1998): Voiced segments expanded by 1.2, 1.5 Degraded performance Liu & Zeng (2006): Temporal envelope (2-50 Hz) contributes at positive SNRs Fine structure (> 500 Hz) contributes at lower SNRs Hodoshima et al. (2007): Slowed down, steady-state suppressed speech more intelligible in reverberant environments

Clear speech 5/5

Page 10: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

10/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

AUTOMATED DETECTION OF TRANSITION SEGMENTS

Auto.Trans. 1/3

Identifying regions for enhancement - segmentation / landmark detection

Manual segmentation accurate high detection rate time consuming subjective useful only for research & not for actual application

Automated detection of segments low detection rate less accurate consistent

Segmentation based on Spectral Transition Measures maximum spectral transitions coincide with segment boundaries

Page 11: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

11/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Earlier studies on automated segmentation Mermelstien (1975): based on loudness variation,

low detection rate, slow carefully uttered speech Glass & Zue (1988): based on auditory critical bands,

detection rate 90%, ± 20ms

Sarkar & Sreenivas (2005): based on level crossing rate, adaptive level allocation, detection rate 78.6%, ± 20ms

Alani & Deriche (1999): wavelet transform based, energy in different bands, detection rate 90.9%, ± 20ms Liu (1996): landmark detection algorithm, energy variation in spectral bands, detection rate 83%, ± 20 ms

Auto.Trans. 2/3

Page 12: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

12/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Earlier studies on automated intelligibility enhancement

Colotte & Laprie (2000)

Segmentation by spectral variation function (82%)

Stops and unvoiced fricatives amplified by +4 dB Time-scaled by 1.8, 2.0 (TD-PSOLA) Missing word identification, TIMIT sentences Improved performance

Skowronski & Harris (2006)

Spectral transition measure based voiced/unvoiced classification Energy redistribution in voiced / unvoiced segments (ERVU) Amplifying low energy temporal regions critical to intelligibility Confusable words TI-46 corpus, 16 talkers, 25 subjects Improved performance for 9 talkers, no degradation for others Enhancement useful for native & non-native listeners

Auto.Trans. 3/3

Page 13: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

13/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Landmarkdetection HNM based

analysis/modification/resynthesis

Segmentboundaries

Speechsignal

Time-scalingfactors

Modifiedspeech

IntensityscalingTime-scaled

speech

Intensityscaling factors

PROPOSED METHOD FOR INTELLIGIBILITY ENHANCEMENT

VC and CV transition segments expanded, steady-state segments compressed, overall speech duration kept unaltered Intensity scaling of transition segments (CVR enhancement)

Objective: reducing the masking of consonantal segments by vowel segments

Intel. Enh. 1/15

Page 14: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

14/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Liu’s Landmark detection algorithm▪ Based on energy variation in 6 spectral bands▪ Segment duration, articulatory, and phonetic class constraints▪ Glottal, sonorant closures, releases, stop closures, releases▪ Peak picking based on convex-hull algorithm▪ Matching of peaks across bands for locating boundaries▪ Detection rate 83%, accuracy ± 20ms

Observations Assumptions in the method

Spectral prominence represented by peak energy in the band One spectral prominence per band

Information regarding frequency location of peak energy not used

Intel. Enh. 2/15

Page 15: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

15/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

2 22 2

1 1

( , ) /k k

f b n k X X f Nc sk kk k k k

Landmark detection using spectral peaks and centroids

Spectrum divided into five non-overlapping bands 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz Spectral peak and centroid estimated in each band & used for calculating transition index

21 210( , ) 10 log max ,E b n X k k kp k

Peak energy

Centroid frequency

Rate-of-rise functions

Transition index

' , ( , ) ( , )E b n E b n K E b n Kp p p

' ( , ) ( , ) ( , )f b n f b n K f b n Kc c c

5 ' '( ) ( , ) ( , )1

T n E b n f b nr p cb

Intel. Enh. 3/15

Page 16: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

16/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Spectral peak & centroid variation in bandsExample: /aka/

Centroid variation not necessarily in phase with energy variation Transitions: Some of energy peaks and centroids undergo change

0-0.4 kHz

0.4-1.2 kHz

1.2-2.0 kHz

2.0-3.5 kHz

3.5-5.0 kHz

Intel. Enh. 4/15

Page 17: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

17/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Peak & centroid ROR contours

Observation: Product of two RORs near-to-zero during steady-states & peaks during transition segments

Example: /aba/

0-0.4 kHz

0.4-1.2 kHz

1.2-2.0 kHz

2.0-3.5 kHz

3.5-5.0 kHz

Intel. Enh. 5/15

Page 18: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

18/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Detection of transition segments

spectrogram

transition index

boundaries

/aba/

Intel. Enh. 6/15

(a) Signal waveform for VCV syllable /aka/ (b) Spectrogram, (c) Transition index (d) transition boundaries detected.

waveform

Page 19: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

19/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

sentence ‘put the butcher block table’, (b) TIMIT land marks, and (c) detected landmarks. Manual anno tation: “bcl”- /b/ closure onset, “b”- /b/ release burst, etc. Automatic detection: landmarks numbered as 5, 6,..etc.

(a)

(b)

(c)

Intel. Enh. 7/15Evaluation using sentences

Page 20: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

20/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Evaluation using sentences 50 manually annotated sentences from TIMIT database

5 speakers: 3 female, 2 male

Detection rates

ST-stopFR-fricativeNAS-nasalV-vowelSV-semivowel

Intel. Enh. 8/15

Detection Rates for TIMIT Sentences

0

20

40

60

80

100

ST FR NAS V SV

Landmark type

Dete

ctio

n (%

)

30 ms

20 ms

10 ms

Page 21: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

21/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Harmonic plus noise model (HNM)(Stylianou 1996)

• Harmonic part / Deterministic part (quasi periodic components of speech)• modeled by harmonics of fundamental frequency

• Noise part /stochastic part (non periodic components)• modeled by LPC coefficients, energy envelope

0

( )( ) ( )exp 2

( )a sh

L nas n A n j kf n n fakk L na

hs n s n s nn

Intel. Enh. 9/15

Page 22: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

22/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

HNM parameters (Lehana and Pandey)

Voiced / Unvoiced Classification (V/UV)

Harmonic part • pitch F0

• Maximum voiced frequency Fm

• Amplitudes and phases of harmonics Ak

Noise part• LPC coefficients• Energy envelope

Voiced Frame →parameters (Harmonic part + noise part )Unvoiced Frame → parameters (noise part )

Intel. Enh. 10/15

Page 23: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

23/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

HNM based analysis stage

Modification using a small parameter set

Low perceptual distortions, preserves naturalness and intelligibility

HNM analysis stage

Intel. Enh. 11/15

PITCH ESTIMATOR

VOICING DETECTOR

MAX. VOICED FREQ. EST.

HARM. AMP. & PHASE EST.

s(n)

ta

Fm

V/UV

HARM

ONIC

PA

RT

PARA

MET

ERS

sh(n)

HARMONICPART SYNTH.

sn(n) HIGH PASS FILTER

LPC MODEL

ENERGY ENV. DETECTORANALYSIS OF

NOISE PART

LPC COEFFS.

ENERGY NOIS

EPA

RT

PARA

MET

ERS

a

V/UV

Fm

ta

++

-

V/UV

Fm

ta

Page 24: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

24/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

HNM based time-scale modification stage

1 1) ) /( )( ( e s e st s sst tr t Scaling factors

Intel. Enh. 12/15

HNM PARAMS.

TIME WARPING

HARMONIC PARTPARAMS.

ALL-POLE FILTER

RANDOMNOISE x

NOISEPART PARAMS.

HIGH PASS FILTER

Fm

+

sh(n)

NOISE PART

sn(n)

s(n)

LPC COEFFS. ENERGY

β

HARMONICPART SYNTH.

Page 25: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

25/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

SNR orig. +6 dB +3 dB 0 dB -2 dB -4 dB -6 dB

aba

Syn.

Tsm. = 1.5Tsm. = 2Tsm. = 3

Example: VCV syllable /aba/Time scaling of consonant duration with steady-state compression

Intel. Enh. 13/15

Page 26: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

26/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum. /ama/

Spectrograms: Time-scaled VCV syllable

Orig.

Synth.

β=1.5

β= 2

β= 2.5

Steady-state compression

Transition segment expansion

Intel. Enh. 14/15

Page 27: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

27/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

/aba/

Original

Time-scaled 1.5

Time-scaledIntensity enhanced+6dB1.5

Time and Intensity scaling: VCV syllable Intel. Enh. 15/15

Page 28: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

28/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

EXPERIMENTAL RESULTSTest material - VCV syllables /aba/, /ada/, /aga/, /apa/, /ata/, /aka/Time scaling factors : 1.0, 1.2, 1.5, 1.8, 2.0

CVR enhancement : +6 dB

12 processing conditions Unprocessed: UP Enhanced CVR without time-scaling: E Time scaled: TS-1.0, TS-1.2, TS-1.5, TS-1.8, TS‑2.0 Enhanced CVR , time scaled: ETS-1.0, ETS-1.2, ETS-1.5, ETS‑1.8, ETS-2.0

Simulated hearing impairment (adding broadband noise)

6 different SNR levels (inf, 0, -3, -6, -9, and -12 dB)

72 test conditions

60 presentations, 5 tests for each condition,1 subject

Exp. Res. 1/2

Page 29: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

29/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

Results

Time-scaling factors 1.2-1.5 appears to be optimum

Time-scaling improves performance at lower SNR levels

Consonant intensity enhancement more effective

Exp. Res. 2/2

Recognition scores at different SNR levels

0

20

40

60

80

100

inf. 0 -6 -12SNR (dB)

Reco

gniti

on s

core

s (%

)

UP.E.TS-1.0ETS-1.0TS-1.2ETS-1.2TS-1.5ETS-1.5TS-1.8ETS-1.8TS-2.0ETS-2.0

Page 30: PAPER  OUTLINE

IIT B

omba

yar

jaya

ni@

e e.i i

tb.a

c .in

ICSCN 2008 - International Conference on Signal Processing, Communications and Networking

30/30Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.

SUMMARY & CONCLUSION

Processing improved recognition scores for stop consonants Without increasing overall speech duration Method found more effective at lower SNR levels Place feature identification improved significantly by processing

Intensity enhancement found more effective than duration enhancement

To be investigated Optimum scaling factors for different speech material Testing using different speech material

Testing on more number of subjects & subjects with sensorineural impairment Analysis in terms of vowel context, consonant category

Quantitative analysis of Intelligibility enhancement - MRT