11
A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

Embed Size (px)

Citation preview

Page 1: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

A Recognition Model for Speech Coding

Wendy Holmes20/20 Speech Limited, UK

A DERA/NXT Joint Venture

Page 2: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

2

Introduction

• Speech coding at low data rates (a few hundred bits/s) requires compact, low-dimensional representation.=> code variable-length speech “segments”.

• Automatic speech recognition is potentially a powerful way to identify useful segments for coding.

• BUT: HMM-based coding has limitations:• shortcomings of HMMs as production models

• typical recognition feature sets (e.g. cepstral coefficients) impose limits on coded speech quality

• difficult to retain speaker characteristics (at least for speaker-independent recognition).

Page 3: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

3

A “unified” model for speech coding

Page 4: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

4

A simple coding scheme• Demonstrate principles of coding using same model for

both recognition and synthesis.

• Model represents linear formant trajectories.

• Recognition: linear trajectory segmental HMMs of formant features.

• Synthesis: JSRU parallel-formant synthesizer.

• Coding is applied to analysed formant trajectories

=> relatively high bit-rate (typically 600-800 bits/s).

• Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.

Page 5: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

5

Segment coding scheme overview

Page 6: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

6

Formant analyser (EUROSPEECH’97)

– Each formant frequency estimate is assigned a value representing confidence in its measurement accuracy. When formants are indistinct, confidence is low.

– In cases of ambiguity, the analyser offers two alternative sets of formant trajectories for resolution in the recognition process.

“four seven”

Page 7: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

7

Linear formant trajectory recognition

• Feature set: formant frequencies plus mel-cepstrum coefficients and overall energy feature.

• Confidences: represent as variances: low confidence => large variance. Add confidence variance to model variance, so low-confidence features have little influence.

• Formant alternatives: choose one giving highest probability for each possible data segment and model state.

• Numbers of segments depend on phone identity:e.g. 1 segment for fricatives; 3 for voiceless stops.

• Range of durations : segment-dependent minimum and maximum segment duration.

Page 8: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

8

Frame-by-frame synthesizer controls

Values for each of 10 synthesizer control parameters are obtained at 10ms intervals:

Voicing and fundamental frequency from excitation analysis program.

3 Formant frequency controls from formant analyser.

5 Formant amplitude controls from FFT-based method.

• With 6 bits assigned to each of the 10 controls, the baseline data rate is 6000 bits/s.

Page 9: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

9

Segment coding– Segments identified by recognizer are coded using straight-

line fits to observed formant parameters.

– Use a least mean square error criterion. For formant frequencies, frame error is weighted by confidence variance. Thus the more reliable frames have more influence.

– To code a segment, represent value at start, and difference of end value from start value.

– Force continuity across segment boundaries where smooth changes are required for naturalness (e.g. semivowel-vowel boundaries).

– When there are formant alternatives, use those selected by recognizer.

Page 10: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

10

Coding experiments– Tested on 2 tasks: speaker-independent connected digit

recognition and speaker-dependent recognition of airborne reconnaissance reports (500 word vocab.).

– Frame-by-frame analysis-synthesis (at 6000 bits/s) generally produced a close copy of original speech.

– Segment-coded versions preserved main characteristics.

– There were some instances of formant analysis errors.

– In some cases, using the recognizer to select between alternative formant trajectories improved segment coding quality.

– In general, coding still works well even if there are recognition errors, as main requirement is to identify suitable segments for linear trajectory coding.

Page 11: A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

11

Coded at about 600bps– Speaker 1: digits– Speaker 2: digits– Speaker 3: digits– Speaker 1: ARM report

Natural– Speaker 1: digits– Speaker 2: digits– Speaker 3: digits– Speaker 1: ARM

report

Speech Coding results

Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.

Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.