VQ Source Models - Ellis & Weiss 2006-05-16 - /12 1 1. Source Models for Separation 2. VQ with Perceptual Weighting 3. Phase and Resynthesis 4. Conclusions VQ Source Models: Perceptual & Phase Issues Dan Ellis & Ron Weiss Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,ronw}@ee.columbia.edu http://labrosa.ee.columbia.edu /
VQ Source Models - Ellis & Weiss 2006-05-16 - /121
1. Source Models for Separation 2. VQ with Perceptual Weighting 3.
Phase and Resynthesis 4. Conclusions
VQ Source Models: Perceptual & Phase Issues
Dan Ellis & Ron Weiss Laboratory for Recognition and
Organization of Speech and Audio
Dept. Electrical Eng., Columbia Univ., NY USA
{dpwe,ronw}@ee.columbia.edu http://labrosa.ee.columbia.edu/
Single-Channel Scene Analysis
underconstrained: infinitely many decompositions time-frequency
overlaps cause obliteration .. no obvious segmentation of sources
(?)
time / s time / s
2
4
6
8
0 0.5 1 1.5 2 0
level / dB
combination physics source models
i.e. no projection can guarantee to remove overlaps
• Overlaps ⇒ Ambiguity scene analysis = find “most reasonable”
explanation
• Ambiguity can be expressed probabilistically i.e. posteriors of
sources {Si} given observations X: P({Si}| X) ∝ P(X |{Si})
P({Si})
• Better source models → better inference .. learn from
examples?
3
Vector-Quantized (VQ) Source Models
• “Constraint” of source can be captured explicitly in a codebook
(dictionary):
defines the ‘subspace’ occupied by source
• Codebook minimizes distortion (MSE) by k-means clustering
4
1000 200 300 400 500 600 700 800 codebook index
20
40
60
80
level / dB
VQ Source Models - Ellis & Weiss 2006-05-16 - /12
Simple Source Separation • Given models for sources,
find “best” (most likely) states for spectra:
can include sequential constraints... different domains for
combining c and defining
• E.g. stationary noise:
5
{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2) p(x|i1, i2) = N (x;ci1+
ci2,Σ) combination
model
0 1 2
0 1 2
Codebook Size • Two (main) variables:
o number of codewords o amount of training data
• Measure average accuracy (distortion):
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
)
600 training frames 1000 training frames 2000 training frames 4000
training frames 8000 training frames 16000 training frames o main
effect of
codebook size o larger codebooks need/allow more data o (large
absolute distortion values)
VQ Source Models - Ellis & Weiss 2006-05-16 - /12
Distortion Metric • Standard MSE gives equal weight by
channel
excessive emphasis on high frequencies
• Try e.g. Mel spectrum approx. log spacing of frequency bins
• Little effect (?):
d
0 0.5 1 1.5 2time / s0 0.5 1 1.5 2 0
20
40
60
80
Linear Frequency Mel Frequency
VQ800 Codebook - Linear distortion measure VQ800 Codebook -
Mel/cube root distance
codebook index codebook index
Resynthesis Phase • Codewords quantize spectrum magnitude
phase has arbitrary offset due to STFT grid
• Resynthesis (ISTFT) requires phase info use mixture phase? no
good for filling-in
• Spectral peaks indicate common instantaneous frequency ( ) can
quantize and cumulate in resynthesis .. like the “phase
vocoder”
8
-40
-20
0
20
0 200 400 600 800 1000 0
200
400
600
800
1000
Resynthesis Phase (2) • Can also improve phase iteratively
repeat:
goal:
9
X (1)(t, f )=|X(t, f )| · exp{ jφ(1)(t, f )} x(1)(t)=istft{X (1)(t,
f )}
φ(2)(t, f )=∠ ! stft{x(1)(t)}
0 1 2 3 4 5 6 7 83
4
5
6
7
8
Starting from random phase
Evaluating Model Quality • Low distortion is not really the
goal;
models are to constrain source separation fit source spectra but
reject non-source signals
• Include sequential constraints e.g. transition matrix for
codewords .. or smaller HMM with distributions over codebook
• Best way to evaluate is via a task e.g. separating speech from
noise
10 0 0.5 1 1.5 2 2.5
80 60 40 20 0
80 60 40 20 0
80 60 40 20 0
level / dB
time / s
fre q
/ b in
Direct VQ (magsnr = -0.6 dB)
HMM-smoothed VQ (magsnr = -2.1 dB)
VQ Source Models - Ellis & Weiss 2006-05-16 - /12
Future Directions • Factorized codebooks
codebooks too large due to combinatorics separate codebooks for
type, formants, excitation?
• Model adaptation many speaker-dependents model, or... single
speaker-adapted model, fit to each speaker
• Using uncertainty enhancing noisy speech for listeners: use
special tokens to preserve uncertainty
11
0 1000 2000 3000 4000 5000 6000 7000 8000 -40
-20
0
20
40
Summary • Source models permit separation of
underconstrained mixtures or at least inference of source
state
• Explicit codebooks need to be large .. and chosen to optimize
perceptual quality
• Resynthesis phase can be quantized .. using “phase vocoder”
derivative .. iterative re-estimation helps more
12
Extra Slides
VQ Source Models - Ellis & Weiss 2006-05-16 - /12
Other Uses for Source Models Projecting into the model’s
space:
• Restoration / Extension inferring missing parts
• Generation / Copying: adaptation + fitting
VQ Source Models - Ellis & Weiss 2006-05-16 - /12
Example 2: Mixed Speech Recog. • Cooke & Lee’s Speech
Separation Challenge
short, grammatically-constrained utterances:
<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>
e.g. "bin white at M 5 soon" t5_bwam5s_m5_bbilzp_6p1.wav
System Noise Condition
HTK 1.0 45.7 82.0 88.6 87.2
GDL-MAP I 2.0 33.2 68.6 85.4 87.3
GDL-MAP II 2.7 7.6 14.8 49.6 77.2
oracle 1.1 4.2 8.4 39.1 76.4
SDL 1.4 3.4 7.7 38.4 77.3
Table 2: Word error rates (percent) on the noisy development
set.
The error rate for the “random-guess” system is 87%. The sys-
tems in the table are: 1) The default HTK recognizer, 2) IBM–
GDL MAP–adapted to the speech separation training data, 3)
MAP–adapted to the speech separation training data and
artifi-
cially generated training data with added noise, 4) Oracle
MAP
adapted Speaker dependent system with known speaker IDs, 5)
MAP adapted speaker dependent models with SDL
6. Experiments and Results
The Speech Separation Challenge [1] involves separating the
mixed speech of two speakers drawn from of a set of 34
speakers.
An example utterance is place white by R 4 now. In each
record-
ing, one of the speakers says white while the other says blue,
red
or green. The task is to recognize the letter and the digit of
the
speaker that said white.
We decoded the two component signals under the assumption
that one signal contains white and the other does not, and
vice
versa. We then used the association that yielded the highest
com-
bined likelihood.
Log-power spectrum features were computed at a 15 ms rate.
Each frame was of length 40 ms and a 640 point FFT was used
producing a 3195 dimensional log-power-spectrum feature
vector.
6 dB 3 dB 0 dB !3 dB !6 dB !9 dB 0
50
100 a) Same Talker
6 dB 3 dB 0 dB !3 dB !6 dB !9 dB 0
50
100 b) Same Gender
6 dB 3 dB 0 dB !3 dB !6 dB !9 dB 0
50
SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human
Figure 1: Word error rates for the a) Same Talker, b) Same
Gender
and c) Different Gender cases.
5the DC component was discarded
6 dB 3 dB 0 dB -3 dB -6 dB -9dB All
ST 29 42 47 47 46 55 44.3
SG 8 10 13 13 15 30 14.8
DG 9 8 11 18 22 36 17.3
All 16.0 21.2 25.0 26.8 28.8 41.2 26.5
Table 3: Word error rates (percent) for grammar and acoustic
con-
straints. ST-Same Talker, SG-Same Gender, DG-Different Gen-
der. Conditions where our system outperformed human listeners
are bolded.
Figure 1 shows results for the 3 different conditions. Human
listener performance [1] is shown along with the performance
of
the SDL recognizer without separation, GMM without dynam-
ics, using acoustic level dynamics, and using both grammar
and
acoustic-level dynamics.
The top plot in Figure 1 shows word error rates (WER) for the
Same Talker condition. In this condition, two recordings from
the
same speaker are mixed together. This conditions best
illustrates
the importance of temporal constrains. By adding the acoustic
dynamics, performance is improved considerably. By combin-
ing grammar and acoustic dynamics, performance improves
again,
surpassing human performance in the !3 dB condition. The second
plot in Figure 1 shows WER for the Same Gender
condition. In this condition, recordings from two different
speak-
ers of the same gender are mixed together. In this condition
our
system surpasses human performance in all conditions except 6 dB
and !9 dB.
The third plot in Figure 1 showsWER for the Different Gender
condition. In this condition, our system surpasses human
perfor-
mance in the 0 dB and 3 dB conditions. Interestingly, temporal
constraints do not improve performance relative to GMM
without
dynamics as dramatically as in the same talker case, which
indi-
cates that the characteristics of the two speakers in a short
segment
are effective for separation. The performance of our best system,
which uses both gram-
mar and acoustic-level dynamics, is summarized in Table 3. This
system surpassed human lister performance at SNRs of 0 dB and !3 dB
on average across all speaker conditions. Averaging across all
SNRs, the system surpassed human performance in the Same Gender
condition. Based on these initial results, we envision that
super-human performance over all conditions is within reach.
7. References
[1] Martin Cooke and Tee-Won Lee, “Interspeech speech separation
challenge,” http : //www.dcs.shef.ac.uk/ !
martin/SpeechSeparationChallenge.htm, 2006.
[2] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone
source separation using high resolution signal reconstruction,”
ICASSP, 2004.
[3] S. Roweis, “Factorial models and refiltering for speech
separation and denoising,” Eurospeech, pp. 1009–1012, 2003.
[4] P. Varga and R.K. Moore, “Hidden markov model decomposition of
speech and noise,” ICASSP, pp. 845–848, 1990.
[5] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation
for multivariate Gaussian mixture observations of Markov chains,”
IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2,
pp. 291– 298, 1994.
[6] Peder Olsen and Satya Dharanipragada, “An efficient integrated
gen- der detection scheme and time mediated averaging of gender
depen- dent acoustic models,,” in Proceedings of Eurospeech 2003,
Geneva, Switzerland, September 1-4 2003, vol. 4, pp.
2509–2512.
o Model individual speakers (512 mix GMM) o Infer speakers and gain
o Reconstruct speech o Recognize as normal... • Grammar constraints
a big help
Kristjansson et al. Interspeech’06
w er
Model-Based Separation • Central idea:
Employ strong learned constraints to disambiguate possible sources
{Si} = argmaxSi P(X | {Si})
• e.g. fit speech-trained Vector-Quantizer to mixed spectrum:
separate via T-F mask (again)
16
MAXVQ Results: Denoising
time
time
time
time
time
fr eq u en cy
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords,
and 100sec of isolated noise (NOISEX) to fit 32 codewords; testing
on new signals at 0dB SNR
MAXVQ Results: Denoising
time
time
time
time
time
fr eq u en cy
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords,
and 100sec of isolated noise (NOISEX) to fit 32 codewords; testing
on new signals at 0dB SNR
fro m
R ow
VQ Source Models - Ellis & Weiss 2006-05-16 - /12
Separation or Description? • Are isolated waveforms required?
clearly sufficient, but may not be necessary not part of perceptual
source separation!
• Integrate separation with application? e.g. speech
recognition
words output = abstract description of signal
17
vs.
Evaluation • How to measure separation performance?
depends what you are trying to do
• SNR? energy (and distortions) are not created equal different
nonlinear components [Vincent et al. ’06]
• Intelligibility? rare for nonlinear processing to improve
intelligibility listening tests expensive
• ASR performance? separate-then-recognize too simplistic; ASR
needs to accommodate separation
18