Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Using Motherese in Speech Recognition

EE516 final projectSteven SchimmelMarch 13, 2003

What is Motherese?

The way mothers talk to their children when they are young

example:

Why use Motherese in Speech Recognition?

The exaggerated motherese speech helps infants to distinguish better between phonetic categories

examples:

Why use Motheresein Speech Recognition?

Mothers provide a great variety in word pronunciation, simulating many different talkers

If an infant can benefit from this, can an ASR benefit too?

Presentation Outline

Data preparation Building the recognizers Training and testing Performance Distance measure Formant analysis Conclusions

Data preparation

Conversations between mothers and adults, and between mothersand infants

Keyword transcription Keyword extraction Screening Dividing over training and test sets

Keywords

Bead Key Sheep

Pot Sock Top

Boot Shoe Spoon

Dividing data into equal setsmother ‘bead’ ‘boot’ ‘sock’

A

B

set ‘bead’ ‘boot’ ‘sock’

1

2

3

4

Building the recognizers

HTK toolkit Isolated word recognizers Speech coding: MFCC, filterbank with

26 channels, 13 coefficients every 10ms HMM prototype: left-to-right, 8 states,

single Gaussians

Training and testing

Flat-start initialization: use global mean and variance of training data for all models, assign equal probability to all states

Repeated embedded training: use all training data to simultaneously update all models

Performance

Infant-directed recognizer worse than adult-directed recognizer on adult-directed speech

But ID recognizer better on AD material than AD recognizer on ID material

ADtrain

ADtest

IDtrain

IDtest

AD 98.75 94.58 76.25 85.42

ID 88.75 90.00 96.67 83.33

Distance measure

Distance between HMMs λ0 and λ1

OT is an observed sequence of length T of feature vectors generated from λ0

Ergodicity

)|(log)|(log1

lim),( 1010 TTT

OPOPT

D

N

iijitjj atobt

1

)1()(log)(log

α-recursion to compute P(OT|λ)

Joint probability bj(ot) causes underflow Log probability:

N

iijitjj atobt

1

)1()()(

N

iijitj atob

1

)1(log)(log

Log of summation

Recursively apply

Make sure that the assumption b>a holds at every step of the recursion

)}log(){log(1log)log()log( abeaba

Distances

ID boot pot sheep shoe sock spoon topbead 7.55 15.90 6.73 8.68 14.74 8.75 14.93boot 8.84 7.40 8.24 8.26 5.38 8.34

pot 14.05 15.30 6.13 9.27 3.21sheep 5.75 11.41 9.53 12.33shoe 12.09 8.27 14.04sock 8.79 5.78

spoon 9.67

AD boot pot sheep shoe sock spoon topbead 9.42 17.91 9.82 11.74 16.44 11.87 16.90boot 12.84 12.88 11.58 11.49 7.60 12.06

pot 22.11 23.56 6.65 14.40 4.99sheep 9.26 20.41 16.23 20.23shoe 19.62 13.02 21.78sock 12.49 7.10

spoon 14.20

Formant analysis

Extract vowels from keywords using phone-based speech recognizer

Estimate frequencies of formant 1 and formant 2

200 400 600 800 1000 1200 1400

infant-directed speech

Formant 1 (Hz)

/iy/ /uw/ /oh/

200 400 600 800 1000 1200 14001000

1500

2000

2500

3000

3500

4000

4500

5000adult-directed speech

Formant 1 (Hz)

For

man

t 2

(Hz)

/iy/ /uw/ /oh/

Formant 1 vs. formant 2

Conclusions

In general not worthwhile to train an ASR system on motherese speech

More false positives due to reduced selectivity of HMMs

Motherese not required for coverage of all speech sounds; use multiple speakers

Acquiring natural motherese speech is more difficult due to infants’ presence

Documents

Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003