18
Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Embed Size (px)

Citation preview

Page 1: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Using Motherese in Speech Recognition

EE516 final projectSteven SchimmelMarch 13, 2003

Page 2: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

What is Motherese?

The way mothers talk to their children when they are young

example:

Page 3: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Why use Motherese in Speech Recognition?

The exaggerated motherese speech helps infants to distinguish better between phonetic categories

examples:

Page 4: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Why use Motheresein Speech Recognition?

Mothers provide a great variety in word pronunciation, simulating many different talkers

If an infant can benefit from this, can an ASR benefit too?

Page 5: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Presentation Outline

Data preparation Building the recognizers Training and testing Performance Distance measure Formant analysis Conclusions

Page 6: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Data preparation

Conversations between mothers and adults, and between mothersand infants

Keyword transcription Keyword extraction Screening Dividing over training and test sets

Page 7: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Keywords

Bead Key Sheep

Pot Sock Top

Boot Shoe Spoon

Page 8: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Dividing data into equal setsmother ‘bead’ ‘boot’ ‘sock’

A

B

set ‘bead’ ‘boot’ ‘sock’

1

2

3

4

Page 9: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Building the recognizers

HTK toolkit Isolated word recognizers Speech coding: MFCC, filterbank with

26 channels, 13 coefficients every 10ms HMM prototype: left-to-right, 8 states,

single Gaussians

Page 10: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Training and testing

Flat-start initialization: use global mean and variance of training data for all models, assign equal probability to all states

Repeated embedded training: use all training data to simultaneously update all models

Page 11: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Performance

Infant-directed recognizer worse than adult-directed recognizer on adult-directed speech

But ID recognizer better on AD material than AD recognizer on ID material

ADtrain

ADtest

IDtrain

IDtest

AD 98.75 94.58 76.25 85.42

ID 88.75 90.00 96.67 83.33

Page 12: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Distance measure

Distance between HMMs λ0 and λ1

OT is an observed sequence of length T of feature vectors generated from λ0

Ergodicity

)|(log)|(log1

lim),( 1010 TTT

OPOPT

D

Page 13: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

N

iijitjj atobt

1

)1()(log)(log

α-recursion to compute P(OT|λ)

Joint probability bj(ot) causes underflow Log probability:

N

iijitjj atobt

1

)1()()(

N

iijitj atob

1

)1(log)(log

Page 14: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Log of summation

Recursively apply

Make sure that the assumption b>a holds at every step of the recursion

)}log(){log(1log)log()log( abeaba

Page 15: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Distances

ID boot pot sheep shoe sock spoon topbead 7.55 15.90 6.73 8.68 14.74 8.75 14.93boot 8.84 7.40 8.24 8.26 5.38 8.34

pot 14.05 15.30 6.13 9.27 3.21sheep 5.75 11.41 9.53 12.33shoe 12.09 8.27 14.04sock 8.79 5.78

spoon 9.67

AD boot pot sheep shoe sock spoon topbead 9.42 17.91 9.82 11.74 16.44 11.87 16.90boot 12.84 12.88 11.58 11.49 7.60 12.06

pot 22.11 23.56 6.65 14.40 4.99sheep 9.26 20.41 16.23 20.23shoe 19.62 13.02 21.78sock 12.49 7.10

spoon 14.20

Page 16: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Formant analysis

Extract vowels from keywords using phone-based speech recognizer

Estimate frequencies of formant 1 and formant 2

Page 17: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

200 400 600 800 1000 1200 1400

infant-directed speech

Formant 1 (Hz)

/iy/ /uw/ /oh/

200 400 600 800 1000 1200 14001000

1500

2000

2500

3000

3500

4000

4500

5000adult-directed speech

Formant 1 (Hz)

For

man

t 2

(Hz)

/iy/ /uw/ /oh/

Formant 1 vs. formant 2

Page 18: Using Motherese in Speech Recognition EE516 final project Steven Schimmel March 13, 2003

Conclusions

In general not worthwhile to train an ASR system on motherese speech

More false positives due to reduced selectivity of HMMs

Motherese not required for coverage of all speech sounds; use multiple speakers

Acquiring natural motherese speech is more difficult due to infants’ presence