Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
EE E6820: Speech & Audio Processing & Recognition
Lecture 5:Speech modeling
Dan Ellis <[email protected]>Michael Mandel <[email protected]>
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
February 19, 2009
1 Modeling speech signals2 Spectral and cepstral models3 Linear predictive models (LPC)4 Other signal models5 Speech synthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 1 / 46
Outline
1 Modeling speech signals
2 Spectral and cepstral models
3 Linear predictive models (LPC)
4 Other signal models
5 Speech synthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 2 / 46
The speech signal
watch thin as a dimeahas
mdnctcl
^
θ zwzh e
III ayε
Elements of the speech signal
spectral resonances (formants, moving)
periodic excitation (voicing, pitched) + pitch contour
noise excitation
transients (stop-release bursts)
amplitude modulation (nasals, approximants)
timing!
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 3 / 46
The source-filter model
Notional separation of
source: excitation, fine time-frequency structure
filter: resonance, broad spectral structure
Glottal pulsetrain
Fricationnoise
Vocal tractresonances+ Radiation
characteristicSpeech
Voiced/
unvoiced
Pitch
Formants
Source Filter
t
f
t
More a modeling approach than a single model
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 4 / 46
Signal modeling
Signal models are a kind of representationI to make some aspect explicitI for efficiencyI for flexibility
Nature of model depends on goalI classification: remove irrelevant detailsI coding/transmission: remove perceptual irrelevanceI modification: isolate control parameters
But commonalities emergeI perceptually irrelevant detail (coding) will also be irrelevant for
classificationI modification domain will usually reflect ‘independent’
perceptual attributesI getting at the abstract information in the signal
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 5 / 46
Different influences for signal models
ReceiverI see how signal is treated by listeners
→ cochlea-style filterbank models . . .
Transmitter (source)I physical vocal apparatus can generate only a limited range of
signals . . .
→ LPC models of vocal tract resonances
Making explicit particular aspectsI compact, separable correlates of resonances
→ cepstrum
I modeling prominent features of NB spectrogram
→ sinusoid models
I addressing unnaturalness in synthesis
→ Harmonic+noise model
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 6 / 46
Application of (speech) signal models
Classification / matchingGoal: highlight important information
I speech recognition (lexical content)I speaker recognition (identity or class)I other signal classificationI content-based retrieval
Coding / transmission / storageGoal: represent just enough information
I real-time transmission, e.g. mobile phonesI archive storage, e.g. voicemail
Modification / synthesisGoal: change certain parts independently
I speech synthesis / text-to-speech (change the words)I speech transformation / disguise (change the speaker)
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 7 / 46
Outline
1 Modeling speech signals
2 Spectral and cepstral models
3 Linear predictive models (LPC)
4 Other signal models
5 Speech synthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 8 / 46
Spectral and cepstral models
Spectrogram seems like a good representationI long historyI satisfying in useI experts can ‘read’ the speech
What is the information?I intensity in time-frequency cellsI typically 5ms × 200 Hz × 50 dB
→ Discarded detail:I phaseI fine-scale timing
The starting point for other representations
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 9 / 46
Short-time Fourier transform (STFT) as filterbankView spectrogram rows as coming from separate bandpass filters
sound
f
Mathematically:
X [k , n0] =∑n
x [n]w [n − n0] exp
(−j
2πk(n − n0)
N
)=∑n
x [n]hk [n0 − n]
where hk [n] = w [−n] exp(j 2πkn
N
)n
hk[n]w[-n]
ω
Hk(ejω)W(ej(ω − 2πk/N))
2πk/N
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 10 / 46
Spectral models: which bandpass filters?
Constant bandwidth? (analog / FFT)
But: cochlea physiology & critical bandwidths
→ implement ear models with bandpass filters & choosebandwidths by e.g. CB estimates
Auditory frequency scalesI constant ‘Q’ (center freq / bandwidth), mel, Bark, . . .
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 11 / 46
Gammatone filterbankGiven bandwidths, which filter shapes?
I match inferred temporal integration windowI match inferred spectral shape (sharp high-freq slope)I keep it simple (since it’s only approximate)
→ Gammatone filtersI 2N poles, 2 zeros, low complexityI reasonable linear match to cochlea
h[n] = nN−1e−bn cos(ωin)time →
10050 200 500 1000 2000 5000
-10
0
-30
-20
-40
-50
mag
/ dB
freq / Hz
z plane
2
2
2
2
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 12 / 46
Constant-BW vs. cochlea model
Frequency responses Spectrograms
-50
-40
-30
-20
-10
0Effective FFT filterbank
Gai
n / d
B
-50
-40
-30
-20
-10
0
Gai
n / d
B
Gammatone filterbank
0 1000 2000 3000 4000 5000 6000 7000 8000
0 1000 2000 3000 4000 5000 6000 7000 8000Freq / Hz
FFT-based WB spectrogram (N=128)
freq
/ H
z
0 0.5 1 1.5 2 2.5 30
2000
4000
6000
8000
Q=4 4 pole 2 zero cochlea model downsampled @ 64
freq
/ H
ztime / s
0 0.5 1 1.5 2 2.5 3
100
200
500
1000
2000
5000
Magnitude smoothed over 5-20 ms time window
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 13 / 46
Limitations of spectral models
Not much data thrown awayI just fine phase / time structure (smoothing)I little actual ‘modeling’I still a large representation
Little separation of features
e.g. formants and pitch
Highly correlated featuresI modifications affect multiple parameters
But, quite easy to reconstructI iterative reconstruction of lost phase
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 14 / 46
The cepstrum
Original motivation: assume a source-filter model:
Excitation source g[n]
n
nResonance filter H(ejω)
ω
Define ‘Homomorphic deconvolution’:source-filter convolution g [n] ∗ h[n]FT → product G (e jω)H(e jω)log → sum log G (e jω) + log H(e jω)IFT → separate fine structure cg [n] + ch[n]= deconvolution
DefinitionReal cepstrum cn = idft (log |dft(x [n])|)
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 15 / 46
Stages in cepstral deconvolution
Original waveform has excitationfine structure convolved withresonances
DFT shows harmonics modulatedby resonances
Log DFT is sum of harmonic‘comb’ and resonant bumps
IDFT separates out resonantbumps (low quefrency) and regular,fine structure (‘pitch pulse’)
Selecting low-n cepstrum separatesresonance information(deconvolution / ‘liftering’)
0 100 200 300 400-0.2
0
0.2Waveform and min. phase IR
samps
0 1000 2000 30000
10
20abs(dft) and liftered
freq / Hz
freq / Hz0 1000 2000 3000
-40
-20
0
log(abs(dft)) and liftered
0 100 200
0
100
200 real cepstrum and lifter
quefrency
dB
pitch pulse
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 16 / 46
Properties of the cepstrum
Separate source (fine) from filter (broad structure)I smooth the log magnitude spectrum to get resonances
Smoothing spectrum is filtering along frequencyi.e. convolution applied in Fourier domain→ multiplication in IFT (‘liftering’)
Periodicity in time → harmonics in spectrum → ‘pitch pulse’in high-n cepstrum
Low-n cepstral coefficients are DCT of broad filter /resonance shape
cn =
∫log∣∣X (e jω)
∣∣ (cos nω +����j sin nω) dω
0 1000 2000 3000 4000 5000 6000 7000
-0.1
0
0.1
0 1 2 3 4 5-1
0
1
2
5th order Cepstral reconstructionCepstral coefs 0..5
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 17 / 46
Aside: correlation of elements
Cepstrum is popular in speech recognitionI feature vector elements are decorrelated
frames
5
10
15
20
25
4
8
12
16
20
5 10 15 20
4
8
12
16
20
-5
-4
-3
-2
50 100 150
Cep
stra
l co
effi
cien
tsA
ud
ito
ry
spec
tru
m
Covariance matrixFeatures Example joint distrib (10,15)
2468
1012141618
-5 0 5-4-3-2-10123
I c0 ‘normalizes out’ average log energy
Decorrelated pdfs fit diagonal GaussiansI simple correlation is a waste of parameters
DCT is close to PCA for (mel) spectra?
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 18 / 46
Outline
1 Modeling speech signals
2 Spectral and cepstral models
3 Linear predictive models (LPC)
4 Other signal models
5 Speech synthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 19 / 46
Linear predictive modeling (LPC)
LPC is a very successful speech modelI it is mathematically efficient (IIR filters)I it is remarkably accurate for voice (fits source-filter distinction)I it has a satisfying physical interpretation (resonances)
Basic mathI model output as linear function of prior outputs:
s[n] =
(p∑
k=1
aks[n − k]
)+ e[n]
. . . hence “linear prediction” (pth order)I e[n] is excitation (input), AKA prediction error
⇒ S(z)
E (z)=
1
1−∑p
k=1 akz−k=
1
A(z)
. . . all-pole modeling, ‘autoregression’ (AR) model
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 20 / 46
Vocal tract motivation for LPC
Direct expression of source-filter model
s[n] =
(p∑
k=1
aks[n − k]
)+ e[n]
Pulse/noise excitation
Vocal tract
e[n] s[n]H(z) = 1/A(z)
z-plane
H(z)
f
|H(ejω)|
Acoustic tube models suggest all-pole model for vocal tract
Relatively slowly-changingI update A(z) every 10-20 ms
Not perfect: Nasals introduce zeros
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 21 / 46
Estimating LPC parameters
Minimize short-time squared prediction error
E =m∑
n=1
e2[n] =∑n
(s[n]−
p∑k=1
aks[n − k]
)2
Differentiate w.r.t. ak to get equations for each k:
0 =∑n
2
s[n]−p∑
j=1
ajs[n − j ]
(−s[n − k])
∑n
s[n]s[n − k] =∑
j
aj
∑n
s[n − j ]s[n − k]
φ(0, k) =∑
j
ajφ(j , k)
where φ(j , k) =∑m
n=1 s[n − j ]s[n − k] are correlationcoefficients
I p linear equations to solve for all ajs . . .
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 22 / 46
Evaluating parameters
Linear equations φ(0, k) =∑p
j=1 ajφ(j , k)
If s[n] is assumed to be zero outside of some window
φ(j , k) =∑n
s[n − j ]s[n − k] = rss(|j − k |)
I rss(τ) is autocorrelation
Hence equations become:r(1)r(2)
...r(p)
=
r(0) r(1) · · · r(p − 1)r(1) r(2) · · · r(p − 2)
......
. . ....
r(p − 1) r(p − 2) · · · r(0)
a1
a2...
ap
Toeplitz matrix (equal antidiagonals)→ can use Durbin recursion to solve
(Solve full φ(j , k) via Cholesky)
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 23 / 46
LPC illustration
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
time / samp
-60
-40
-20
0
0 50 100 150 200 250 300 350 400
dB
windowed original
original spectrum
LPC residual
residual spectrum
LPC spectrum
-0.3
-0.2
-0.1
0
0.1
Actual poles
z-plane
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 24 / 46
Interpreting LPC
Picking out resonancesI if signal really was source + all-pole resonances, LPC should
find the resonances
Least-squares fit to spectrumI minimizing e2[n] in time domain is the same as minimizing
E 2(e jω) by Parseval→ close fit to spectral peaks; valleys don’t matter
Removing smooth variation in spectrumI 1
A(z) is a low-order approximation to S(z)
IS(z)E(z) = 1
A(z)I hence, residual E (z) = A(z)S(z) is a ‘flat’ version of S
Signal whitening:I white noise (independent x [n]s) has flat spectrum→ whitening removes temporal correlation
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 25 / 46
Alternative LPC representations
Many alternate p-dimensional representationsI coefficients {aj}I roots {λj}:
∏(1− λjz
−j)
= 1−∑
ajz−1
I line spectrum frequencies. . .I reflection coefficients {kj} from lattice form
I tube model log area ratios gj = log(
1−kj
1+kj
)Choice depends on:
I mathematical convenience / complexityI quantization sensitivityI ease of guaranteeing stabilityI what is made explicitI distributions as statistics
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 26 / 46
LPC applications
Analysis-synthesis (coding, transmission)
I S(z) = E(z)A(z) hence can reconstruct by filtering e[n] with {aj}s
I whitened, decorrelated, minimized e[n]s are easy to quantize. . . or can model e[n] e.g. as simple pulse train
Recognition / classificationI LPC fit responds to spectral peaks (formants)I can use for recognition (convert to cepstra?)
ModificationI separating source and filter supports cross-synthesisI pole / resonance model supports ‘warping’
e.g. male → female
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 27 / 46
Aside: Formant tracking
Formants carry (most?) linguistic information
Why not classify → speech recognition?
e.g. local maxima in cepstral-liftered spectrum pole frequencies inLPC fit
But: recognition needs to work in all circumstancesI formants can be obscured or undefined
freq
/ H
zfr
eq /
Hz
0
1000
2000
3000
4000
time / s0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Original (mpgr1_sx419)
Noise-excited LPC resynthesis with pole freqs
→ need more graceful, robust parameters
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 28 / 46
Outline
1 Modeling speech signals
2 Spectral and cepstral models
3 Linear predictive models (LPC)
4 Other signal models
5 Speech synthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 29 / 46
Sinusoid modeling
Early signal models required low complexity
e.g. LPC
Advances in hardware open new possibilities. . .
NB spectrogram suggests harmonics model
time / s
freq
/ H
z
0 0.5 1 1.50
1000
2000
3000
4000
I ‘important’ info in 2D surface is set of tracks?I harmonic tracks have ∼smooth propertiesI straightforward resynthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 30 / 46
Sine wave models
Model sound as sum of AM/FM sinusoids
s[n] =
N[n]∑k=1
Ak [n] cos(nωk [n] + φk [n])
I Ak , ωk , φk piecewise linear or constantI can enforce harmonicity: ωk = kω0
Extract parameters directly from STFT frames:
freq
time
mag
I find local maxima of |S [k , n]| along frequencyI track birth/death and correspondence
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 31 / 46
Finding sinusoid peaks
Look for local maxima along DFT frame
i.e. |s[k − 1, n]| < |S [k , n]| > |S [k + 1, n]|Want exact frequency of implied sinusoid
I DFT is normally quantized quite coarselye.g. 4000 Hz / 256 bands = 15.6 Hz/band
mag
nitu
de
frequency
spectral samples
quadratic fit to 3 points
interpolated frequency and magnitude
I may also need interpolated unwrapped phase
Or, use differential of phase along time (pvoc):
ω =ab − ba
a2 + b2where S [k, n] = a + jb
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 32 / 46
Sinewave modeling applications
Modification (interpolation) and synthesisI connecting arbitrary ω and φ requires cubic phase interpolation
(because ω = φ)
Types of modificationI time and frequency scale modification
. . . with or without changing formant envelope
I concatenation / smoothing boundariesI phase realignment (for crest reduction)
Non-harmonic signals? OK-ish
time / s
freq
/ H
z
0 0.5 1 1.50
1000
2000
3000
4000
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 33 / 46
Harmonics + noise model
Motivation to improve sinusoid model becauseI problems with analysis of real (noisy) signalsI problems with synthesis quality (esp. noise)I perceptual suspicions
Model
s[n] =
N[n]∑k=1
Ak [n] cos(nkω0[n])︸ ︷︷ ︸Harmonics
+ e[n](hn[n] ∗ b[n])︸ ︷︷ ︸Noise
I sinusoids are forced to be harmonicI remainder is filtered and time-shaped noise
‘Break frequency’ Fm[n] between H and N
Harmonicity limit Fm[n]
HarmonicsNoise
freq / Hz0
40
dB
0
20
1000 2000 3000
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 34 / 46
HNM analysis and synthesisDynamically adjust Fm[n] based on ‘harmonic test’:
time / s
freq
/ H
z
0 0.5 1 1.50
1000
2000
3000
4000
Noise has envelopes in time e[n] and frequency Hn
time / s0
0 040dB
1000
2000
3000freq / Hz
0.01 0.02 0.03
Hn[
k]
e[n]
reconstruct bursts / synchronize to pitch pulses
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 35 / 46
Outline
1 Modeling speech signals
2 Spectral and cepstral models
3 Linear predictive models (LPC)
4 Other signal models
5 Speech synthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 36 / 46
Speech synthesis
One thing you can do with models
Synthesis easier than recognition?I listeners do the work
. . . but listeners are very critical
Overview of synthesis
text speechText normalization
Synthesis algorithm
Phoneme generation
Prosody generation
I normalization disambiguates text (abbreviations)I phonetic realization from pronunciation dictionaryI prosodic synthesis by rule (timing, pitch contour)
. . . all control waveform generation
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 37 / 46
Source-filter synthesis
Flexibility of source-filter model is ideal for speech synthesis
Glottal pulsesource
Noisesource
Vocal tractfilter+
SpeechVoiced/unvoiced
Pitchinfo
Phonemeinfo
t
t
t
t
th ax k ae t
Excitation source issues
voiced / unvoiced / mixture ([th] etc.)
pitch cycles of voiced segments
glottal pulse shape → voice quality?
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 38 / 46
Vocal tract modelingSimplest idea: store a single VT model for each phoneme
th ax k ae ttime
freq
but discontinuities are very unnatural
Improve by smoothing between templates
th ax k ae ttime
freq
trick is finding the right domain
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 39 / 46
Cepstrum-based synthesis
Low-n cepstrum is compact model of target spectrum
Can invert to get actual VT IR waveforms:
cn = idft(log |dft(x [n])|)⇒ h[n] = idft(exp(dft(cn)))
All-zero (FIR) VT response
→ can pre-convolve with glottal pulses
time
ee
ae
ah
Glottal pulse inventory Pitch pulse times (from pitch contour)
I cross-fading between templates OK
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 40 / 46
LPC-based synthesis
Very compact representation of target spectraI 3 or 4 pole pairs per template
Low-order IIR filter → very efficient synthesis
How to interpolate?I cannot just interpolate aj in a running filterI but lattice filter has better-behaved interpolation
+ +
z-1a1 kp-1
a2
a3
z-1
z-1
e[n]
+
e[n] s[n]
-1
s[n] +
k0
+
z-1z-1
z-1
+
- -
What to use for excitationI residual from original analysisI reconstructed periodic pulse trainI parametrized residual resynthesis
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 41 / 46
Diphone synethsis
Problems in phone-concatenation synthesisI phonemes are context-dependentI coarticulation is complexI transitions are critical to perception
→ store transitions instead of just phonemes
mdnctcl
^
θ zwzh e
III ayεPhones
Diphone segments
I ∼ 40 phones ⇒ ∼ 800 diphonesI or even more context if have larger database
How to splice diphones together?I TD-PSOLA: align pitch pulses and cross fadeI MBROLA: normalized multiband
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 42 / 46
HNM synthesis
High quality resynthesis of real diphone units + parametricrepresentation for modification
I pitch, timing modificationsI removal of discontinuities at boundaries
Synthesis procedureI linguistic processing gives phones, pitch, timingI database search gives best-matching unitsI use HNM to fine-tune pitch and timingI cross-fade Ak and ω0 parameters at boundaries
time
freq
Careful preparation of database is keyI sine models allow phase alignment of all unitsI larger database improves unit match
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 43 / 46
Generating prosody
The real factor limiting speech synthesis?
Waveform synthesizers have inputs forI intensity (stress)I duration (phrasing)I fundamental frequency (pitch)
Curves produced by superposition of (many) inferred linguisticrules
I phrase final lengthening, unstressed shortening, . . .
Or learn rules from transcribed elements
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 44 / 46
Summary
Range of modelsI spectral, cepstralI LPC, sinusoid, HNM
Range of applicationsI general spectral shape (filterbank) → ASRI precise description (LPC + residual) → codingI pitch, time modification (HNM) → synthesis
IssuesI performance vs computational complexityI generality vs accuracyI representation size vs quality
Parting thought
not all parameters are created equal. . .
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 45 / 46
References
Alan V. Oppenheim. Speech analysis-synthesis system based on homomorphicfiltering. The Journal of the Acoustical Society of America, 45(1):309–309, 1969.
J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975.
Bishnu S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linearprediction of the speech wave. The Journal of the Acoustical Society of America,50(2B):637–655, 1971.
J.E. Markel and AH Gray. Linear Prediction of Speech. Springer-Verlag New York,Inc., Secaucus, NJ, USA, 1982.
R. McAulay and T. Quatieri. Speech analysis/synthesis based on a sinusoidalrepresentation. Acoustics, Speech, and Signal Processing [see also IEEETransactions on Signal Processing], IEEE Transactions on, 34(4):744–754, 1986.
Wael Hamza, Ellen Eide, Raimo Bakis, Michael Picheny, and John Pitrelli. The IBMexpressive speech synthesis system. In INTERSPEECH, pages 2577–2580, October2004.
E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 46 / 46