Upload
dominic-fleming
View
224
Download
0
Embed Size (px)
Citation preview
Using Motherese in Speech Recognition
EE516 final projectSteven SchimmelMarch 13, 2003
What is Motherese?
The way mothers talk to their children when they are young
example:
Why use Motherese in Speech Recognition?
The exaggerated motherese speech helps infants to distinguish better between phonetic categories
examples:
Why use Motheresein Speech Recognition?
Mothers provide a great variety in word pronunciation, simulating many different talkers
If an infant can benefit from this, can an ASR benefit too?
Presentation Outline
Data preparation Building the recognizers Training and testing Performance Distance measure Formant analysis Conclusions
Data preparation
Conversations between mothers and adults, and between mothersand infants
Keyword transcription Keyword extraction Screening Dividing over training and test sets
Keywords
Bead Key Sheep
Pot Sock Top
Boot Shoe Spoon
Dividing data into equal setsmother ‘bead’ ‘boot’ ‘sock’
A
B
set ‘bead’ ‘boot’ ‘sock’
1
2
3
4
Building the recognizers
HTK toolkit Isolated word recognizers Speech coding: MFCC, filterbank with
26 channels, 13 coefficients every 10ms HMM prototype: left-to-right, 8 states,
single Gaussians
Training and testing
Flat-start initialization: use global mean and variance of training data for all models, assign equal probability to all states
Repeated embedded training: use all training data to simultaneously update all models
Performance
Infant-directed recognizer worse than adult-directed recognizer on adult-directed speech
But ID recognizer better on AD material than AD recognizer on ID material
ADtrain
ADtest
IDtrain
IDtest
AD 98.75 94.58 76.25 85.42
ID 88.75 90.00 96.67 83.33
Distance measure
Distance between HMMs λ0 and λ1
OT is an observed sequence of length T of feature vectors generated from λ0
Ergodicity
)|(log)|(log1
lim),( 1010 TTT
OPOPT
D
N
iijitjj atobt
1
)1()(log)(log
α-recursion to compute P(OT|λ)
Joint probability bj(ot) causes underflow Log probability:
N
iijitjj atobt
1
)1()()(
N
iijitj atob
1
)1(log)(log
Log of summation
Recursively apply
Make sure that the assumption b>a holds at every step of the recursion
)}log(){log(1log)log()log( abeaba
Distances
ID boot pot sheep shoe sock spoon topbead 7.55 15.90 6.73 8.68 14.74 8.75 14.93boot 8.84 7.40 8.24 8.26 5.38 8.34
pot 14.05 15.30 6.13 9.27 3.21sheep 5.75 11.41 9.53 12.33shoe 12.09 8.27 14.04sock 8.79 5.78
spoon 9.67
AD boot pot sheep shoe sock spoon topbead 9.42 17.91 9.82 11.74 16.44 11.87 16.90boot 12.84 12.88 11.58 11.49 7.60 12.06
pot 22.11 23.56 6.65 14.40 4.99sheep 9.26 20.41 16.23 20.23shoe 19.62 13.02 21.78sock 12.49 7.10
spoon 14.20
Formant analysis
Extract vowels from keywords using phone-based speech recognizer
Estimate frequencies of formant 1 and formant 2
200 400 600 800 1000 1200 1400
infant-directed speech
Formant 1 (Hz)
/iy/ /uw/ /oh/
200 400 600 800 1000 1200 14001000
1500
2000
2500
3000
3500
4000
4500
5000adult-directed speech
Formant 1 (Hz)
For
man
t 2
(Hz)
/iy/ /uw/ /oh/
Formant 1 vs. formant 2
Conclusions
In general not worthwhile to train an ASR system on motherese speech
More false positives due to reduced selectivity of HMMs
Motherese not required for coverage of all speech sounds; use multiple speakers
Acquiring natural motherese speech is more difficult due to infants’ presence