Pitch-synchronous overlap add (TD-PSOLA)

Pitch-synchronous overlap add (TD-PSOLA)

• PSOLA is a time domain algorithm• Pseudo code

1. Find the pitch points of the signal2. Apply Hanning window centered on the pitch points and

extending to the next and previous pitch point3. Add waves back

• To slow down speech, duplicate frames• To speed up, remove frames• Hanning windowing preserves signal energy

• Undetectable if epochs are accurately found. Why?We are not altering the vocal filter, but changing signal spacing

Purpose: Modify pitch or timing of a signal

TD-PSOLA IllustrationsPitch (window and add)

Duration (insert or remove)

TD-PSOLA Pitch Points (Epochs)

• TD-PSOLA requires an exact marking of pitch points in a time domain signal

• Pitch mark– Marking any part within a pitch period is okay as long as

the algorithm marks the same point for every frame– The most common marking point is the instant of glottal

closure, which identifies a quick time domain descent

• Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn}

• Estimate pitch period distance = (pk – pk+1)/2

TD-PSOLA Evaluation

• Advantages– As a time domain algorithm, it is unlikely that any other

approach will be more efficient (O(N))– Listeners cannot perceive signal alteration of up to 50%

• Disadvantages– Epoch marking must be exact– Only timing changes are possible

Time Domain Pitch Detection• Auto Correlation

– Correlate a window of speech with a previous window

– Find the best match– Issue: too many false peaks

• Peak and center clipping– Algorithm to reduce false peaks– clip the top/bottom of a signal– Center the remainder around 0

• Other alternatives– Researchers propose many other

pitch detection algorithms – There are much debate as to

which is the best

Auto Correlation

1. Auto Correlation1/M ∑n=0,M-1 xn xn-k ;if n-k < 0 xn-k = 0Find the k that maximizes the sum

2. Difference Function1/M ∑n=1,M-1 |(xn – xn-k)|; if n-k<0 sn-k = 0Find the k that minimizes the sum

3. Considerationsa. Difference approach is fasterb. Both can get false positivesc. The YIN algorithm combines both techniques

Harmonic Product Spectrum

Pseudo CodeDivide signal into frames (20-30 ms long)Perform FFTDown sample FFT by factors of 2, 3, 4

(taking every 2nd , 3rd , 4th values)Add FFT and down sampled spectrums togetherThe pitch harmonics will line up

(The spectrum will “spike” at the pitch value)Find the spike: return fsample / fftSize * index

Frequency Spectrum

Background Noise• Definition: an unwanted sound or an unwanted

perturbation to a wanted signal• Examples:

– Clicks from microphone synchronization– Ambient noise level: background noise– Roadway noise– Machinery– Additional speakers– Background activities: TV, Radio, dog barks, etc.

– Classifications• Stationary: doesn’t change with time (i.e. fan)• Non-stationary: changes with time (i.e. door closing, TV)

Noise Spectrums

• White Noise: constant over range of f• Pink Noise: Decreases by 3db per octave; perceived equal across f• Brown(ian): Decreases proportional to 1/f2 per octave• Red: Decreases with f (either pink or brown)• Blue: increases proportional to f• Violet: increases proportional to f2

• Gray: proportional to a psycho-acoustical curve• Orange: bands of 0 around musical notes• Green: noise of the world; pink, with a bump near 500 HZ• Black: 0 everywhere except 1/fβ where β>2 in spikes• Colored: Any noise that is not white

Audio samples: http://en.wikipedia.org/wiki/Colors_of_noiseSignal Processing Information Base: http://spib.rice.edu/spib.html

Power measured relative to frequency f

http://en.wikipedia.org/wiki/Colors_of_noise

http://spib.rice.edu/spib.html

Applications• ASR: Prevent significant degradation in noisy environments

Goal: Minimize recognition degradation with noise present

• Sound Editing and Archival: –Improve intelligibility of audio recordings–Goals: Eliminate perceptible noise; recover audio from wax recordings

• Mobile Telephony: –Transmission of audio in high noise environments–Goal: Reduce transmission requirements

• Comparing audio signals–A variety of digital signal processing applications–Goal: Normalize audio signals for ease of comparison

Signal to Noise Ratio (SNR)• Definition: Power ratio between a signal and noise

that interferes.• Standard Equation in decibels:

SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise)

• For digitized speechSNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2)

– sf is an array holding samples from a frame

– nf is an array of noise samples.

• Note: if sf(n) = nf(x), SNRf = 0

Stationary Noise Suppression• Requirements

– Maximize the amount of noise removed– Minimize signal distortion– Efficient algorithm with low big-Oh complexity

• Problems– Tradeoff between removing noise and distorting the signal– More noise removal tends to distort the signal

• Popular approaches– Time domain: Moving average filter (distorts frequency domain)– Frequency domain: Spectral Subtraction– Time domain: Weiner filter (using LPC)

Auto regression Noise Removal• Definition: An autoregressive process is one

where a value can be determined by a linear combination of previous values

• Formula: Xt = c + ∑0,P-1ai Xt-i + ntc is a constant, nt is the noise, the summation is the pure signal

• This is none other than linear prediction; noise is the residue.

• Applying the LPC filter to the signal separates noise from signal (Wiener Filter)

Spectral Subtraction

Perform FFT on all windowed framesIF speech not present

Update the estimate of the noisy spectrum { σnt + (1- σ)nt-1, 0 <= σ <=1 }

ELSE Subtract the estimated noise spectrumPerform an inverse FFT

S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-27, Apr. 1979.

Assumption: Noisy signal: yt = st + nt

st is the clean signal and nt is additive noise

Implementation Issues1. Question: How do we estimate the noise?

Answer: Use the frequency distribution during times when no voice is present

2. Question: How do we know when voice is present?Answer: Use Voice Activity Detection algorithms (VAD)

3. Question: Even if we know the noise amplitudes, what about phase differences between the clean and noisy signals?Answer: Human hearing largely ignores phase differences

4. Question: Is the noise independent of the signal?Answer: We assume that it noise is linear and does not interact with the signal.

5. Question: Are noise distributions really stationary?Answer: We assume yes.

Phase Distortions• Problem: We don’t know how much of the phase in

an FFT is from noise and from speech. • Assumption: The algorithm assumes the phase of

both are the same (that of the noisy signal).• Result: When SNR approaches 0db the audio has an

hoarse sounding voice.• Why? The phase assumption means that the

expected noise magnitude is incorrectly calculated.• Conclusion: There is a limit to spectral subtraction

utility when SNR is close to zero

Evaluation• Advantage: Easy to understand and implement

• Disadvantages– The noise estimate is not exact

• When too high, speech portions will be lost• When too low, some noise remains• When a noise frequency exceeds the noisy sound

frequency, a negative frequency results causes musical tone artifacts

– Non-linear or interacting noise• Negligible with large SNR values• Significant impact when SNR is small

Musical noiseDefinition: Random isolated tone bursts across the frequency.

Why? Most implementations set frequency bin magnitudes to zero if noise reduction would cause them to become negative

Green dashes: noisy signal, Solid line: noise estimateBlack dots: projected clean signal

Spectral Subtraction Enhancements

• Eliminate negative frequencies• Reduce the noise estimates by some factor

o Vary the noise estimate factor in different frequency bandso Larger in regions outside of human speech range

• Apply psycho-acoustical methodso Only attempt to remove perceived noise, not all noiseo Human hearing masks sounds of adjacent frequencieso A loud sound masks sounds even after it ceases

• Adaptive noise estimation: Nt(f) = λFGt(p-1)+(1-λF)Nt-1(f)

Threshold of Hearing

Masking

Acoustical Effects

• Characteristic Frequency (CF): The frequency that causes maximum response at a point of the Cochlea Basilar Membrane

• Neuron exhibit a maximum response for 20 ms and then decrease to a steady state, shortly after the stimulus is removed

• Masking effects can be simultaneous or temporal– Simultaneous: one signal drowns out another– Temporal: One signal masks the ones that follow– Forward: still audible after masker removed (5ms–150ms)– Back: weak signal masked from a strong one following (5ms)

Voice Activity Detector (VAD)• Many VAD algorithms exist• Possible approaches to consider

– Energy above background noise– Low Zero crossing rate– Determine if pitch is present– Low fractal dimensions compared to pure noise– Low LPC residual

• General principle: It is better to misclassify noise as speech than to misclassify speech as noise

• Standard algorithms: telephone/cell phone environments

Possible VAD algorithm

boolean vad: double[] frame // returns true if speech present

IF frame energy < low noise threshold (standard deviation units) RETURN false;IF energy < low noise threshold RETURN FALSEIF energy > high noise threshold RETURN TRUE

FOR forward frames IF forward frame energy < low noise threshold RETURN FALSE

IF forward frame energy > high noise threshold FOR previous ¼ second of frames

COUNT previous frames having a large 0-crossing rateIF count > 0-crossing threshold (standard deviation units)

IF this frame index > than first frame with 0-crossing rate > thresholdRETURN true

RETURN false

Note: energy and 0-crossings of noise estimated from the initial ¼ second

Documents

Pitch-synchronous overlap add (TD-PSOLA)