Upload
teddy-zugana
View
223
Download
0
Embed Size (px)
Citation preview
8/3/2019 Haifeng Ms Thesis
1/78
BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR
TELEPHONE-ASSISTIVE APPLICATIONS
APPROVED BY SUPERVISORY COMMITTEE:
Dr. Philipos Loizou, Chair
Dr. Andrea Fumagalli
Dr. Murat Torlak
8/3/2019 Haifeng Ms Thesis
2/78
Copyright 2002
Haifeng Qian
All Rights Reserved
8/3/2019 Haifeng Ms Thesis
3/78
BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR
TELEPHONE-ASSISTIVE APPLICATIONS
by
HAIFENG QIAN
THESIS
Presented to the Faculty of
The University of Texas at Dallas
in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
THE UNIVERSITY OF TEXAS AT DALLAS
May 2002
8/3/2019 Haifeng Ms Thesis
4/78
iv
ACKNOWLEDGEMENTS
I would like to thank my adviser, Dr. Philipos Loizou, for his guidance in my research. He
has offered me many helpful suggestions throughout my two-year graduate study.
I would also like to thank Dr. Andrea Fumagalli and Dr. Murat Torlak, for their valuable
feedback on this manuscript.
Thanks also go to my coworkers in the Speech Processing Lab, for their cooperation and
friendship. It has been my pleasure to work with them. Dr. Oguz Poroy, a former member of
the lab, helped me with the hardware building in this thesis.
I would also like to take this opportunity to thank National Institutes of Health for supporting
the research under Grant R01 DC03421.
8/3/2019 Haifeng Ms Thesis
5/78
v
BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR
TELEPHONE-ASSISTIVE APPLICATIONS
Haifeng Qian, M.S.E.E.
The University of Texas at Dallas, 2002
Supervising Professor: Dr. Philipos C. Loizou
This thesis addresses the problem of helping hearing-impaired people to use telephones.
There are two aspects of this work: a Bluetooth-based wireless phone adapter and a
bandwidth-extension algorithm. Built upon the Bluetooth technology, the proposed phone
adapter routes the telephone audio signal to the hearing aid or the CI processor wirelessly,
and hence disables environmental noise and interference. The proposed bandwidth-extension
algorithm has the potential to increase speech intelligibility for the hearing-impaired people
by estimating a wide-band signal from the narrow-band telephone signal. This is done by a
piecewise linear estimation based on line spectral frequencies, and a statistical speech-frame
classification technique based on Hidden Markov Models integrated to overcome the
drawback of conventional bandwidth extension algorithms. The phone adapter was tested by
CI users, and the proposed algorithm was evaluated by objective measures. Both results
showed good performance.
8/3/2019 Haifeng Ms Thesis
6/78
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS..................................................................................................... iv
ABSTRACT.............................................................................................................................. v
LIST OF FIGURES................................................................................................................viii
LIST OF TABLES.................................................................................................................... x
1. INTRODUCTION................................................................................................................ 1
2. LITERATURE REVIEW..................................................................................................... 3
2.1 Assistive listening devices ............................................................................................ 3
2.1.1 Hardwired devices................................................................................................ 5
2.1.2 Induction loop devices ......................................................................................... 6
2.1.3 Frequency modulation devices............................................................................. 6
2.1.4 Frequency modulation devices............................................................................. 7
2.2 Telephone recognition by CI users ............................................................................... 8
2.3 Speech enhancement by bandwidth extension............................................................ 11
2.3.1 Fundamentals of bandwidth extension............................................................... 12
2.3.2 Residual extension ............................................................................................. 13
2.3.3 Codebook method .............................................................................................. 15
2.3.4 Linear estimation method................................................................................... 17
3. BLUETOOTH-BASED PHONE ADAPTER.................................................................... 20
3.1 Introduction to Bluetooth ............................................................................................ 20
3.2 Phone adapter design................................................................................................... 233.3 Hardware design.......................................................................................................... 24
3.4 Software design for wireless link................................................................................ 26
3.5 Testing......................................................................................................................... 29
4. BANDWIDTH EXTENSION OF TELEPHONE SPEECH.............................................. 31
4.1 Linear estimation method for bandwidth extension.................................................... 31
8/3/2019 Haifeng Ms Thesis
7/78
vii
4.2 Proposed algorithm for bandwidth extension ............................................................. 37
4.2.1 Residual extension ............................................................................................. 38
4.2.2 Envelope extension ............................................................................................ 40
4.2.3 Classification of speech frames.......................................................................... 43
4.3 Evaluation and results ................................................................................................. 48
4.3.1 Test material....................................................................................................... 49
4.3.2 Objective measures ............................................................................................ 49
4.3.3 Examples............................................................................................................ 57
5. CONCLUSIONS................................................................................................................ 60
BIBLIOGRAPHY................................................................................................................... 63
VITA
8/3/2019 Haifeng Ms Thesis
8/78
viii
LIST OF FIGURES
Figure 2.1. Three components of an ALD ................................................................................ 4
Figure 2.2. Spectrograms of the narrow-band (top) and the wide-band (bottom) speech. ..... 11
Figure 2.3. Architecture of bandwidth extension systems...................................................... 12
Figure 2.4. Residual extension by nonlinear distortion........................................................... 14
Figure 2.5. Envelope extension by codebook mapping. ......................................................... 16
Figure 3.1. Structure of a Bluetooth stack............................................................................... 22
Figure 3.2. Architecture of the phone adapter......................................................................... 24
Figure 3.3. Hardware design. .................................................................................................. 25
Figure 3.4. Phone adapter prototype. ...................................................................................... 26
Figure 3.5. Emulating the Headset Profile.............................................................................. 27
Figure 3.6. Message flow. ....................................................................................................... 28
Figure 4.1. Lack of energy in consonant frames. (a) Original wide-band sentence
spectrogram. (b) Sentence synthesized by original envelopes and residual spectrum
folding. .............................................................................................................................. 35
Figure 4.2. Diagrammatic representation of the bandwidth-extension algorithm. ................. 36
Figure 4.3. Spectrum folding. ................................................................................................. 38
Figure 4.4. Envelope extension. Markers indicate LSF values............................................... 40
Figure 4.5. Effect of artificial dispersion. ............................................................................... 42
Figure 4.6. IS distance comparison of the algorithms............................................................. 51
Figure 4.7. LLR measure comparison of the algorithms. ....................................................... 53
Figure 4.8. Frequency-domain SNR comparison of the algorithms. ...................................... 55
Figure 4.9. Classification performance. .................................................................................. 56
Figure 4.10. Comparison of spectrograms. (a) Original wide-band speech. (b) Estimated
wide-band speech by [6] algorithm. (c) Estimated wide-band speech by proposed
algorithm without HMM. (d) Estimated wide-band speech by proposed algorithm with
HMM................................................................................................................................. 58
8/3/2019 Haifeng Ms Thesis
9/78
ix
Figure 4.11. Comparison of spectrograms. (a) Original wide-band speech. (b) Estimated
wide-band speech by [6] algorithm. (c) Estimated wide-band speech by proposed
algorithm without HMM. (d) Estimated wide-band speech by proposed algorithm with
HMM................................................................................................................................. 59
8/3/2019 Haifeng Ms Thesis
10/78
x
LIST OF TABLES
Table 4.1.--Classification of [6] algorithm. ............................................................................ 34
Table 4.2.--Adjusted thresholds used in B. ......................................................................... 51
8/3/2019 Haifeng Ms Thesis
11/78
1
CHAPTER ONE
INTRODUCTION
Hearing-impaired people, including hearing aid users and cochlear implant (CI) users, often
have difficulty talking through telephones. The intelligibility of telephone speech is
considerably lower than the intelligibility of person-to-person speech. This degradation
results mainly from the following three factors:
1. Lack of visual cues. In a person-to-person conversation, a hearing aid or CI user
often uses lip-reading or other visible cues to help understanding the other person.
When talking on the phone, the audio signal is the only information source that he
can make use of.
2. Loss of high-frequency information. Telephone speech is band-limited to 300Hz-
3400Hz. The spectrum above 3.4kHz, present primarily in fricative consonants
such as 's' 'sh' 'ts' etc., is lost. This results in the muffling effect of telephone
sound, which does not affect normal-hearing people, but greatly affects the
hearing impaired.
3. Additional noise introduced by the interaction between the phone and the hearing
aid. The electromagnetic coupling effect of the phone-handset circuit and the
hearing aid coil results in the feedback and amplification of background noise
[33]. A cellphone's electromagnetic emission is often picked up by a hearing aid
as a buzzing noise [3]. Also, the performance of a CI decreases when using
8/3/2019 Haifeng Ms Thesis
12/78
2
cellphones, and different CI processors are not compatible with certain kinds of
cellphones [28].
To address the third problem, phone adapters have been proposed to help the hearing
impaired in telephone conversation. They are one category of assistive listening devices that
route the audio signal to the hearing aid or CI and hence maximize the signal-to-noise ratio
(SNR) [32][34]. In this thesis, we propose a wireless phone adapter based on Bluetooth
technology, a brand-new transmission technology.
To address the second problem, algorithms can be designed to process telephone
speech to improve intelligibility [6][14][33]. Effort has been put in bandwidth extension
techniques that aim at recovering the lost consonants from the narrow-band telephone
speech. In this thesis, we propose a linear estimation method based on line spectral
frequencies (LSF).
This thesis is organized as follows: Chapter 2 is a review on current assistive listening
devices and bandwidth extension methods that have been developed; Chapter 3 proposes a
Bluetooth-based phone adapter to address the third problem; Chapter 4 proposes a bandwidth
extension algorithm to solve the second problem; Chapter 5 presents conclusions and future
work.
8/3/2019 Haifeng Ms Thesis
13/78
3
CHAPTER TWO
LITERATURE REVIEW
In this chapter, we provide a literature review on assistive listening devices (ALD), telephone
recognition by CI users, and speech bandwidth extension algorithms.
ALD is a general category of devices that is used to help hearing-impaired people in
different applications. They have been developed using all kinds of modern technologies [3]
[19]. Phone adapters are a special group of ALDs that send the telephone audio directly to
the hearing aid or CI in order to minimize exposure to environmental noise.
Telephone usage is one of the major concerns of CI users. Studies [8][28] have shown
that a certain percentage of CI users do not feel comfortable using the telephone. Their
conversation quality is limited by various problems [16], and improvements are needed.
Bandwidth extension algorithms try to improve the general intelligibility by doubling
the sample rate of telephone speech to recover the lost high-frequency information.
Sections 2.1, 2.2 and 2.3 below provide literature reviews on ALDs, CI telephone
comprehension, and bandwidth extension respectively.
2.1 Assistive Listening Devices
Assistive listening devices (ALD) aim at improving the quality of life of hearing-impaired
people. With the help of hearing aids or cochlear implants, hearing-impaired people are
usually able to have a person-to-person communication in a quiet environment. However,
when there is ambient noise or interference, the hearing-impaired people suffer much more
8/3/2019 Haifeng Ms Thesis
14/78
4
degradation than normal-hearing people. Therefore, they often need ALDs designed to pick
up audio signals from the desired source and minimize the undesired interferences.
To ensure the rights of hearing-impaired people, auxiliary services are required
according to the Americans with Disabilities Act (Public Law 336 of the 101st Congress),
which was enacted on July 26, 1990. Public services, operated by government or private
entities, must provide hearing-impaired people the service that is functionally equivalent to
that of normal-hearing people [29]. The auxiliary services include "qualified interpreters,
assistive listening devices, notetakers, and written materials".
Figure 2.1. Three components of an ALD.
An assistive listening device is usually composed of 3 parts: a sound-pickup
component, a sound-generating component, and a transfer component that connects the
previous two. The sound-pickup component, most commonly a microphone, picks up audio
signal from a person, a TV, a stereo, or a telephone. The signal is routed to the ALD user by
hardwire or wireless technology, then is processed, amplified, and sent to the hearing aid or
the processor of the cochlear implant user.
8/3/2019 Haifeng Ms Thesis
15/78
5
Different applications have different needs and require different designs. No single
solution is optimal for all scenarios. Based on the method of sending the audio signal from
the sound-pickup component to the sound-generating component, assistive listening devices
fall into two categories: hardwired devices and wireless devices. The wireless devices can be
further classified into three categories: induction loop, frequency modulated (FM), and
infrared light, named after the transmission technologies used [3]. The different types of
ALDs are discussed in the following sections. More detail is provided for the wireless
devices.
2.1.1 Hardwired Devices
The obvious advantage of hardwired ALDs is that the transfer of sound by a cord is free of
electronic interference. However, for the same reason, they have the disadvantage of losing
mobility. For a personal ALD, the user is limited within a few meters from the sound source;
for a large assistive listening system installed in an auditorium, users are restricted to specific
seats.
A typical example of hardwired ALDs is a currently available phone adapter for
hearing-impaired people. The adapter plugs in between the phone-base and the phone-
handset, takes the speech signal out, and provides an audio output jack that can be connected
to the hearing aid or the CI processor. The user can listen through the adapter while still
talking to the phone-handset [32]. It avoids the sound degradation caused by the phone-
handset speaker and the environmental noise, and therefore provides the user better
conversation quality. This adapter may not work with all phones; since it is hardwired, the
cord length confines the user.
8/3/2019 Haifeng Ms Thesis
16/78
6
Another recently proposed ALD for CI can also be classified as a hardwired device.
An in-the-ear microphone is connected to the CI input. The user only needs to hold the
phone-handset, as normal, and the in-the-ear microphone picks up the sound [34]. It is small
and convenient in size, compatible with all phones, and the environmental noise is partially
blocked as the phone-handset itself can serve as a seal.
2.1.2 Induction Loop Devices
In induction loop ALDs, audio signal received from the desired source is amplified and then
sent to a wire loop that surrounds the room. The alternating current, carrying the signal,
generates an alternating magnetic field inside the room. A coil on the user side picks up this
magnetic field, and an inductive current is generated inside the coil, carrying the desired
audio signal. The coil can be the input to the hearing aids or CI processors.
The advantage of induction loop ALDs is their simple installation. For hearing aids
with a telecoil switch, the ALD requires nothing from the user. Simply walk into the room
and switch to "telecoil" option, and the user is ready to listen through the ALD. However,
induction loop devices are vulnerable to electromagnetic interference. Electrical installations
and wires in the room, or another induction loop ALD nearby, are all possible sources of
interference. For the above reasons, induction loop ALDs are typically used in large public
facilities, such as classrooms and auditoriums.
2.1.3 Frequency Modulation (FM) Devices
FM ALDs use frequency modulation technology as the transmission method. The frequency
variation around the carrier frequency represents the audio information. The user uses a
8/3/2019 Haifeng Ms Thesis
17/78
7
receiver to demodulate the radio frequency signal and retrieve audio signal, which is then
sent to a hearing aid or a CI processor.
The advantages of FM ALDs are their portability, large coverage, and the ability to
broadcast several signals in different channels at the same time. On the other hand, they are
more complicated and expensive than induction loop ALDs; they are subject to interference
of radio signals, which may come from radio broadcast or another FM ALD nearby; there is
also a lack of privacy. FM ALDs are widely used as both personal ALDs and large assistive
listening systems.
2.1.4 Infrared Light Devices
Infrared light technology is similar to FM except that the signal carrier, infrared light, is
directional and it cannot penetrate opaque objects (such as walls).
This property brings the obvious advantage of privacy, as the signal is limited to
inside the room. Also, because an infrared light ALD does not have interference coming from
adjacent rooms and is resistant to radio interference, it provides a higher audio quality than
FM devices. However, infrared light ALDs are the most complicated and expensive of the
three kinds; their high power consumption usually can not be supported by batteries, and
therefore can not be portable; the receiver has to avoid sunlight, as the infrared part of the
sunlight can be a fatal interference to the desired signal. Due to the features of infrared light,
these ALDs are mostly used for home applications.
In Chapter 3, a Bluetooth-based phone adapter is proposed. It belongs to the group of
FM ALDs. By taking advantage of the brand-new Bluetooth technology, it overcomes the
8/3/2019 Haifeng Ms Thesis
18/78
8
pitfalls of traditional FM ALDs and offers better sound quality and resistance to
environmental noise.
2.2 Telephone Recognition by CI Listeners
An important indicator of the life quality of a CI user is whether he is able to carry a
conversation over the telephone, in the absence of lip-reading cues. Telephone usage is part
of many CI rehabilitation programs [31] [2]. According to the survey result in [8], 51% of
Ineraid CI implantees initiate calls and 66% of them answer calls in daily life. In another
recent questionnaire result reported in [28], 51 out of 61 Finnish respondents use telephones.
However, telephone competence shown in these studies was mostly limited to familiar callers
and familiar topics.
Different tests have been designed to evaluate the telephone ability of CI users. In [4]
(1985), one of the earliest studies on this topic, one CI implantee with high performance was
chosen to listen to Central Institute for the Deaf (CID) sentences over telephone and to repeat
them. She obtained 21% of keywords correctly and 47% when listening twice. In a more
systematic study reported in [7] (1989), subjects were tested with sentences sent through an
extension-telephone call, a local call and a long-distance call. The results showed that 23% of
their patients had a significant degree of telephone ability, and that a 50% or higher score in
CID sentences test was a good indicator of telephone competence. The second conclusion
was also confirmed by [8]. Another more recent study, [17] (1998), reported that 68% of the
adult Clarion CI users were able to understand at least half of the sentences over the phone,
and half were able to understand at least 75% of the sentences, 12 months post implantation.
8/3/2019 Haifeng Ms Thesis
19/78
9
Tests designed for prelingually deaf children with CI are different from the above test
for postlingually deaf adults, as these children have no telephone experience and have a
limited vocabulary. In [2], six children were tested with monosyllables, 2-syllable words and
3-syllable words presented through telephone. The average percentage of correct responses
ranged from 50% to 83% for different materials, and some of the children began to use
telephones after this training program. A larger-scale study was reported in [31], which tested
150 prelingually deaf children ranging from 1 year to 5 years after implantation. This
hierarchical test started from recognizing rings and went up to carrying open conversation
with unfamiliar callers. The performance of the children increased significantly over time
and approached the level of normal-hearing children after 5 years.
Although the above results are encouraging, most CI users are not able to have an
interactive conversation with unfamiliar callers about unfamiliar topics, and they describe the
telephone speech quality as weak, hollow, tinny, having echo, fuzzy, or distorted in other
ways [8]. A detailed survey about telephone problems was done in [16]. It collected
information mainly from hearing-aid users, but the results were also applicable for CI users.
Background noise was a problem for 94% of the respondents; 76% of them thought
telephone speech was too soft; 66% of them reported lack of clarity, and this could not be
solved by amplification. 70% of the subjects found coupling a hearing aid with a telephone to
be problematic due to feedback effects, and nearly half of them preferred not to use their
hearing aids with telephones. The respondents also showed a strong desire for improvements
of ALDs.
In [28], the compatibility between CIs and cellular phones were explored. Digital
phones generate a broad-spectrum radio signal, which appears to CI processors as noise. The
8/3/2019 Haifeng Ms Thesis
20/78
10
test results showed that Neucleus CI systems are not compatible with GSM phones, while
Combi 40+ systems are compatible with the GSM phones tested.
There are currently several phone adapters and ALDs for CI users. In [19], three
commercial FM ALD products were evaluated with CI users in a noisy environment. All
subjects demonstrated much higher recognition performance with the help of FM ALDs.
Two widely used phone adapters are TEL-001 (Williams Sound) and TLP-102 (DynaMetric).
Both of them are hardwired ALDs. They plug into the handset jack of a normal telephone,
and provide a direct access to the telephone audio in the form of a mono-plug, which can be
fed into a CI processor. Thus the environmental noise is disabled, and the feedback problem
is avoided. A detailed description of TEL-001, as well as an in-the-ear microphone solution
proposed in [34], can be found in Section 2.1.1.
Speech intelligibility can also be improved for hearing-impaired people by using
signal-processing techniques. Special strategies can be designed to compensate for their
hearing loss. In [33], a frequency shaping method was proposed. It amplified the audio signal
frequency-selectively based on the knowledge of hearing loss at different frequencies. When
evaluated in an intelligibility test, the algorithm combining frequency shaping and frequency-
selective amplitude compression achieved the most speech enhancement: 15-30% increase in
recognition.
Different from the above user-dependent signal processing strategies, bandwidth
extension algorithms improve the general intelligibility by recovering lost high-frequency
information. An overview of bandwidth extension algorithms is given next.
8/3/2019 Haifeng Ms Thesis
21/78
11
2.3 Speech Enhancement by Bandwidth Extension
The telephone speech signal in current telecommunication networks is band-limited to 300-
3400Hz, while the speech bandwidth spans the range of 50Hz to 8000Hz. Figure 2.2 shows
the spectrograms of the narrow-band signal and the wide-band signal. The loss of
information in [50, 300] Hz and [3400, 8000] Hz range causes a muffled effect. For normal-
hearing people, the narrow-band telephone signal is already good enough for intelligibility,
and they prefer a wide-band signal only because it sounds more natural. For the hearing
impaired, the loss of high-frequency consonants is one of the main reasons for the difficulty
in using telephones.
Figure 2.2. Spectrograms of the narrow-band (top) and the wide-band (bottom) speech.
Due to the redundancy nature of human speech, the lost information can be
recovered, at least partially, from the narrow-band signal. Algorithms, such as [6] [9] [14],
have been proposed to solve this problem. Such algorithms can be implemented at the user
end, and therefore require no change for the telephone network. Also, in [35], a coding
8/3/2019 Haifeng Ms Thesis
22/78
12
method is proposed to recover wide-band speech accurately at the expense of additional low-
bitrate transmission of side-information.
2.3.1 Fundamentals of Bandwidth Extension
The low-band part, 50-300Hz, of speech mainly contributes to speech quality, and little to
intelligibility. In [35], the low-band signal is represented by two sinusoids. In [21], this is
done by spectral envelope extension and inserting sinusoids into the residual. The
performance of both methods depends on the accuracy of pitch detection, which is sometimes
unreliable.
In this thesis, we mainly focus on recovering the high-frequency band, i.e. 3400-
8000Hz, of telephone speech. The typical architecture of a bandwidth extension system is
shown in the Figure 2.3.
Figure 2.3. Architecture of bandwidth extension systems.
8/3/2019 Haifeng Ms Thesis
23/78
13
The whole algorithm can be viewed as two separate processes: the residual extension
and the spectral envelope extension. An LPC (Linear Prediction Coding) analyzer extracts
the spectral envelope from the input narrow-band signal. The residual extension module
processes the resulting residual signal, while the envelope extension module predicts the
wide-band spectral envelope, based on the 300-3400Hz portion. The desired signal is then
synthesized by using the wide-band residual and the wide-band LPC coefficients.
Residual extension is discussed in Section 2.3.2. Two main methods of envelope
extension, codebook mapping and linear estimation, are discussed in Section 2.3.3 and 2.3.4,
respectively.
2.3.2 Residual Extension
A short frame of speech signal can be modeled as an autoregressive (AR) random process.
The residual signal is the linear prediction error sequence, defined by the following equation:
=
=M
k
k knsansne1
)()()( (2.1)
where 1a , 2a , , Ma are the LPC coefficients, )(ne is the residual signal, and )(ns is the
speech signal.
The residual signal has a flat spectrum like white noise. In a voiced frame, such as
vowels and semi-vowel consonants, the residual noise has periodicity. This appears as
harmonic peaks in addition to the flat noise-like spectrum. These peaks occur in multiples of
the pitch, the fundamental voice frequency of the speaker.
Therefore, the task of the residual extension module is to double the sampling rate,
from 8kHz to 16kHz, while keeping the whole spectrum flat. If there are harmonics in the
8/3/2019 Haifeng Ms Thesis
24/78
14
narrow-band residual, the wide-band residual should also have the harmonic structure. There
are two methods in common use that accomplish that:
1. Nonlinear distortion method. As is shown in Figure 2.4, the narrow-band residual is
first upsampled by interpolation and then fed into a nonlinear function. The distorted
signal will have the desired bandwidth and harmonic structure over the whole
spectrum. After the whitening filter, the spectrum is flattened and the wide-band
residual is achieved. A popular nonlinear function is given below:
2/)]()1()()1[()( txtxty ++= (2.2)
where x(t) is the input signal, y(t) is the distorted output signal, and is a parameter
between 0 and 1 [20]. When =1, it becomes the absolute value function, which is
used in [35] and achieves good results.
Figure 2.4. Residual extension by nonlinear distortion.
2. Spectrum folding method. This time-domain method proposed in [20] is easy to
implement. The upsampling of the narrow-band residual is done by inserting zeros
instead of interpolating. This is equivalent to folding the spectrum of 0-4000Hz to
4000-8000Hz in the frequency domain. Since the low-frequency spectrum is flat and
has harmonics, the resulting wide-band residual will also have a flat spectrum and
8/3/2019 Haifeng Ms Thesis
25/78
15
harmonics in both the low-frequency part and the high-frequency part. One drawback
of this method is that the harmonic structure is broken at 4kHz. A possible solution is
to change the sampling rate to a multiple of the pitch before performing the folding,
but this requires accurate pitch detection. Another disadvantage is that harmonics in a
real wide-band residual should have descending amplitudes, but in the folding
method, the harmonics at highest frequencies are the reflection of the harmonics at
lowest frequencies and therefore have the same amplitudes. Fortunately, these details
of residual signal do not affect the speech intelligibility substantially, and the
spectrum folding method is widely used.
2.3.3 Codebook Method
Codebook mapping is a popular method to achieve spectral envelope extension [5][9][14].
For this application, the codebook, as is shown in Figure 2.5, consists of two columns. The
first column contains vectors composed of spectral parameters extracted from the narrow-
band signal, while the second column contains vectors extracted from the corresponding
wide-band signal. When an input frame comes in, the parameters are extracted from it and
compared with the vectors in the first column. By vector quantization, the vector closest to
the input parameters is found, and the corresponding wide-band parameters are taken from
the second column to generate the extended spectral envelope.
The codebook is generated from a large training database of speech. Eligible
parameters to be used in the codebook are LPC coefficients, reflection coefficients, LSF,
cepstral coefficients, etc.
8/3/2019 Haifeng Ms Thesis
26/78
16
Figure 2.5. Envelope extension by codebook mapping.
An obvious limitation of codebook mapping method is that the number of possible
outputs is decided by the codebook size. Also, when the parameters of two narrow-band
frames belong to the same group in vector quantization, the corresponding wide-band frames
do not necessarily belong to the same group. The probability of such mismatches increases
when the size of codebook increases [21].
To address the above problems, improved versions of codebook mapping were
proposed:
1. Codebook plus interpolation. When a set of input parameters comes in, a
number of closest codebook items are found. The output is computed as the
weighted sum of the corresponding wide-band parameter-vectors, based on a
certain statistical model [5].
2. Multiple codebooks. Speech frames are classified into several groups, and one
codebook is trained and used separately for each group. In [9], two codebooks
8/3/2019 Haifeng Ms Thesis
27/78
17
were trained and used for voiced and unvoiced frames, and the performance was
found to be superior to other codebook methods.
3. Statistical codebook searching. In order to reduce mismatching, when making
the decision on an upcoming frame, the information from a number of previous
frames are taken into consideration to find the codebook item with the highest
probability. In [14], a codebook search method was proposed, based on hidden
Markov models (HMM).
2.3.4 Linear Estimation Method
Instead of codebook mapping, spectral envelope extension also can be done by linear
estimation. Vector x
, the set of parameters representing the narrow-band spectral envelope,
is first extracted from an input signal frame. Then the corresponding vector y
representing
the wide-band envelope is calculated by feeding x
into a group of linear filters. This is
shown in the following equation:
xMy
= (2.3)
where M is the matrix composed of filter parameters. Then the output spectral envelope is
generated based on vector y
.
If we look at each item in y
and the corresponding row in M, Equation 2.3 can be
viewed as N separate equations, as follows:
Nkxkwky ...,,2,1)()( ==
(2.4)
8/3/2019 Haifeng Ms Thesis
28/78
18
where N is the dimension ofy
and x
; y(k) is the thk element ofy
; )(kw
is the thk row of
M. To generate M from a training database, a large number of yx
pairs are extracted,
and the optimal )(kw
is found by minimize the following cost function:
( ) ,...,,2,1)()(),()(1
NkixkwikykEL
i
===
(2.5)
where L is the size of the training data; )(ix
is the input vector x
in the thi pair of training
data; ),( iky is the thk element of the output vector y
in the thi pair of training data.
From the Wiener filtering theory, the optimal tap weight vector is given below:
( ) NkkYXXXkw TTopt ...,,2,1)()(1
==
(2.6)
where X and Y are composed of training data, and each row consists of one sample ofx
or
y
, respectively; )(kY
is the thk column of matrix Y.
Combining the N equations in Equation (2.6), the final training solution is given
below:
( ) YXXXM TT 1= (2.7)
The advantage of linear estimation method is that it requires much less memory and
computation than the codebook mapping method, and this is a desirable property when
implementing the system. One disadvantage is that the solution may yield invalid values
representing a LPC filter with unstable impulse response. Therefore special adjustments have
to be added to avoid invalid results, and this might introduce artifacts. Also, since we try to
use a linear model to describe the nonlinear relation of narrow-band and wide-band
parameters, a certain degree of distortion can be expected.
8/3/2019 Haifeng Ms Thesis
29/78
19
To compensate for the nonlinearity, speech classification techniques are combined
with linear estimation, as proposed in [23] and [6]. This is also known as piecewise linear
estimation. Speech frames are classified into several groups, and estimation matrixes are
trained and used for each group separately.
The key factor in the linear estimation method is the choice of parameters to be used.
Various speech parameters, such as LPC coefficients, LSF, and reflection coefficients, can
represent the spectral envelope. In [9], a linear estimation using sub-band log energies is
compared with a codebook mapping method, and shows higher spectral distortion, even with
a classification into eight groups.
LSFs are good candidates. Several linear estimation algorithms using LSFs have been
proposed. In [21], the envelope extension by estimating LSFs was compared with a
codebook-mapping counterpart using both an objective distortion measure and a subjective
evaluation, and showed a better performance. In [6], the number of LSFs, i.e. the order of
LPC analysis, doubled for the wide-band signal. The lower half of expanded-signal LSFs are
calculated by dividing the narrow-band LSF values by 2. This is equivalent to copying the
narrow-band spectrum to the low-band of the output signal. Thus the transparency of the
system in 300-3400Hz band is guaranteed.
8/3/2019 Haifeng Ms Thesis
30/78
20
CHAPTER THREE
BLUETOOTH-BASED PHONE ADAPTER
As discussed in Chapter 2, one of the main difficulties for hearing impaired people to use
telephones is caused by the noise introduced by the interaction between the phone and the
hearing aid. Thus phone adapters are needed to route the audio signal directly to the hearing
aid or CI processor.
This chapter proposes a wireless phone adapter, which falls into the category of FM
assistive listening devices. This adapter is based on Bluetooth technology, which is a new
wireless transmission standard. The favorable features of this technology make the adapter
superior to traditional ALDs.
3.1 Introduction to Bluetooth
Wireless technology has dramatically changed the way people interact with one another and
receive information. Bluetooth, a short-distance wireless communication standard, aims at
replacing cables and therefore making the world truly wireless. It defines a universal radio
interface, through which devices within 10 meters can form short-distance ad hoc networks.
By using the dynamic Bluetooth links between these mobile devices, a large number of new
products and services will become possible.
The physical carrier of Bluetooth connections is the 2.4-2.5 GHz band, which is
available for public use in most countries. This band is divided into 79 1-MHz-width
channels, and each channel is divided into 625-s-length time slots. The modulation scheme
8/3/2019 Haifeng Ms Thesis
31/78
21
used for one time slot of one channel is based on binary frequency shift keying (BFSK). A
Bluetooth link, with one side called "the master device" and the other side called "the slave
device", uses one channel in each time slot and jumps to another channel in the next time
slot. The sequence of channels to be used is a pseudo-random sequence decided by the
master device. The above scheme is called Frequency Hopping Code Division Multiple
Access (FHCDMA). The device that initiates the ad hoc network becomes the master, while
the other devices in the network are slaves. Two sides of a link alternately transmit and
receive, i.e., there is only one-way traffic in one time slot. There are two kinds of links:
synchronous connection-oriented (SCO) links and asynchronous connectionless (ACL) links.
An SCO link is composed of evenly spaced pairs of time slots in the hopping sequence, with
a 64Kb/s bit-rate; ACL links use the slots not reserved by SCO links. One Bluetooth link can
support ACL links and up to 3 SCO links at the same time, and its theoretical maximum bit-
rate is 1Mb/s [12][29].
To ensure interoperability between Bluetooth devices, protocol layers and application
profiles are defined in the Bluetooth Specification. Figure 3.1 illustrates the structure of a
Bluetooth protocol stack. The layers above and including HCI (Host Controller Interface) can
be viewed as software layers, while the rest of the stack can be viewed as hardware layers.
Baseband and Link Manager layers implement the transport actions described in the previous
paragraph, and provide a command interface, the HCI layer. The L2CAP (Logical Link
Control and Adaptation Protocol) layer divides large packets from higher layers into small
packets for lower-layer transmission and reassembles received small packets into large
packets intelligible to higher layers; the L2CAP layer also supports multiple applications by
assigning logical channels. Thus, TCS (Telephony Control Specification), SDP (Service
8/3/2019 Haifeng Ms Thesis
32/78
22
Discovery Protocol) and RFCOMM are unaware of physical communication details. The
RFCOMM layer further emulates a serial port, so that many conventional applications can be
used on it with no or minor changes. Application profiles for different scenarios are also
included in the Bluetooth Specification to guide implementation, as two devices following
the same profile have guaranteed compatibility [22].
Figure 3.1. Structure of a Bluetooth stack.
TCS=Telephony Control SpecificationSDP=Service Discovery Protocol
L2CAP=Logical Link Control and Adaptation Protocol
HCI=Host Controller Interface
8/3/2019 Haifeng Ms Thesis
33/78
23
Currently available products equipped with Bluetooth technology include cellphones,
phone adapters, headsets, PC cards, modems, printers, and printer adapters, which are mainly
used to replace cables. A more important potential of Bluetooth is auto-synchronization.
Personal mobile devices, such as cellphones, PDAs and laptops, can form a mobile network
and always keep updated with each other. Wherever the user goes, the personal devices can
automatically find the local Bluetooth-enabled devices and make use of the information and
services provided. It is projected that in the near future, by 2005, when there will be millions
of Bluetooth-enabled products, the auto-synchronization function will bring great benefit to
users. Places such as stores, cinemas and airports will only need a Bluetooth-enabled
information access point to do all their business.
The key factor for the success of this technology is interoperability. Manufacturers
have to make their products comply with the Bluetooth Specification so that they can
communicate with other products. A minimum requirement is to pass the qualification
program administered by the Bluetooth Special Interest Group (SIG).
3.2 Phone Adapter Design
The phone adapter proposed in this thesis is an application of Bluetooth wireless technology
for hearing impaired people. The Bluetooth link is used to transmit the audio signal. The
architecture of the proposed adapter is illustrated in Figure 3.2.
A pair of Bluetooth transceivers forms the wireless link, and each of them is
connected to a host controller running the software protocol stack. The host controller can be
a PC, a microcontroller, or a digital signal processor (DSP). The telephone signal is
connected to the duplex audio interface of the master device. The audio output of the slave
8/3/2019 Haifeng Ms Thesis
34/78
24
device is connected to the hearing aid or the CI processor, while the input is connected to a
lapel microphone.
Figure 3.2. Architecture of the phone adapter.
The slave device is first initialized to active slave mode, waiting for the connection
request from the master. When a telephone call comes in, or the user makes a call, the master
device sends out paging messages to find the slave device, and initiates a SCO link. After the
connection is confirmed by both sides, the user can talk through this Bluetooth link without
the need to hold the telephone handset or the need to connect the hearing aid or CI processors
directly to the telephone jack. Since the audio signal is directly transmitted from the phone to
the hearing aid or CI processor, environmental noise is disabled, and the user will be able to
enjoy high speech quality even under extreme noisy situations (e.g., in a crowded restaurant,
in a car).
3.3 Hardware Design
A prototype system was developed in this thesis. In this prototype, a pair of Ericsson's
Bluetooth Starter Kits (EBSK) was used for the transceiver hardware. The PC works as the
8/3/2019 Haifeng Ms Thesis
35/78
25
host controller of the master device, which is stationary and connected to the telephone. A
Motorola DSP56309 processor is used as the host controller of the slave device and provides
portability of the user side.
Figure 3.3. Hardware design.
Figure 3.3 shows the block diagram of the hardware design of the slave device. The
host I/O port of the DSP56309 Evaluation-Module (EVM) board is programmed to send HCI
commands to EBSK. A sequence of 9 HCI protocol commands is implemented in this device
in assembly language. Their function is to reset the EBSK, to set basic transmission settings,
and to put the EBSK in an active slave mode. In order for this prototype to be stand-alone,
the assembly program is written into the flash memory on the EVM board and is executed
when the DSP is reset. To meet the electrical requirement of the host interface of the EBSK,
a signal amplifying and shifting circuit was designed based on a LM318N chip (National
8/3/2019 Haifeng Ms Thesis
36/78
26
Semiconductor), and a voltage converter circuit was built upon a LMC7660IN chip (National
Semiconductor) to provide a negative voltage for the amplifier. Figure 3.4 shows a
photograph of the portable slave-side device.
Figure 3.4. Phone adapter prototype.
3.4 Software Design for Wireless Link
The ultimate goal of this project is not only to design a phone adapter, but an ALD that can
receive audio signals from all Bluetooth-enabled sources, such as TVs, stereos, and
computers. An audio source with a Bluetooth transceiver should be able to find this device
with a function description and setup an SCO link, all through the procedures defined in the
Bluetooth Specification. To achieve this interoperability, the host controller of our device
needs to support the L2CAP protocol, the RFCOMM protocol, the SDP protocol, and one
application profile defined in the specification.
8/3/2019 Haifeng Ms Thesis
37/78
27
The Headset Profile is most similar in function to our phone adapter. It defines the
procedure of setting up a Bluetooth audio link between an audio-gateway device and a
headset device. In our phone adapter, the telephone-side device corresponds to the audio
gateway, and the user-side device corresponds to the headset. The transceiver hardware is
still a pair of EBSK, but the host controllers of both sides are two computers with Ericsson
Bluetooth PC Reference Stack loaded. This software stack is a COM-server (Component
Object Model) in the form of an executable file. It contains HCI, L2CAP, RFCOMM, and
SDP layers, and provides a programming interface [10]. Application programs, written in
C++, communicate with the protocol layers by sending commands and receiving event-
messages. These programs emulate the operations of the audio gateway and the headset, as
defined in the Bluetooth Specification.
Figure 3.5. Emulating the Headset Profile.
8/3/2019 Haifeng Ms Thesis
38/78
28
The structure of the prototype system is shown in Figure 3.5. For practical reasons,
the user-side device should be portable and cannot be hosted by a computer. Here we make
the assumption that the software portion of this prototype can be implemented using a
microcontroller or a DSP.
Figure 3.6. Message flow (adapted from [10][29]).
8/3/2019 Haifeng Ms Thesis
39/78
29
Figure 3.6 shows the message flow of setting up a headset link. The application
program first initializes itself by registering at the protocol layers, and starts by writing the
SDP service record as a headset. When the remote audio gateway inquires the function
description, SDP answers with the information written by the program. After answering the
L2CAP and RFCOMM connection requests properly, a virtual serial-port link is established
between the two devices. "RING", an AT command defined in [13], is sent from the audio
gateway to the headset, and the program answers with "AT+CKPD=200", another AT
command [11], which says that the incoming call is accepted. Then the audio gateway
initializes the SCO link that carries the speech signal, and the process is finalized.
The application programs were developed based on sample programs provided with
the Ericsson stack. Those programs were modified for our application. When both host
controllers used our emulating programs, the EBSKs completed the whole process shown in
Figure 3.6.
3.5 Testing
In order to evaluate the effectiveness of the wireless transmission on the quality of audio
signal, three CI users were invited to talk through the phone-adapter prototype. All users are
fitted with the MED-EL CIS-LINK processor. The three CI subjects were using their daily
MED-EL processor with the CIS strategy running at 1000-2000 pulses/second. The audio I/O
of the portable device was split to two mono-jacks, one leading to the audio input of the CI
processor and the other leading to a microphone. The user listened through his CI processor
and talked to the microphone. The stationary side of the prototype was connected to different
8/3/2019 Haifeng Ms Thesis
40/78
30
audio sources, including a handset with a person talking, a sound-card jack of a computer,
and audio wires taken from a normal telephone.
Good quality was reported by the CI users with a person talking through the handset,
when both sides were within reasonable distance inside the lab (the lab is 7 meters long and 6
meters wide). When the user, holding the portable device, walked outside the room and
closed the door (which has a metal frame), the signal was substantially faded.
In order to verify the interoperability of the software design, the user-side device (the
virtual headset) was tested with an Ericsson T28W cellphone coupled to a DBA10 adapter,
which supports the audio-gateway function in the Headset Profile. When the headset program
was tested with T28W+DBA10, the process hung at a minor step before the "RING"
message. The last message from the remote device was a modem-status-change event on the
virtual connection, and that message was not answered properly. This could be due to
compatibility problems between current Bluetooth products, and further work on this adapter
might need technical support from the manufacturer.
8/3/2019 Haifeng Ms Thesis
41/78
31
CHAPTER FOUR
BANDWIDTH EXTENSION OF TELEPHONE SPEECH
In Chapter 2, we have seen that both codebook-mapping algorithms and linear-estimation
algorithms have their advantages and disadvantages, and that linear estimation requires much
less memory and computation. This chapter proposes a linear estimation method based on
LSF parameters, combined with speech classification techniques to overcome drawbacks of
generic linear estimation. Section 4.1 provides a discussion about the algorithm proposed in
[6], which motivated the proposed method. Section 4.2 provides a detailed description of the
improved algorithm. Section 4.3 evaluates this algorithm using several objective measures.
4.1 Linear Estimation Method for Bandwidth Extension
In this section, we continue the discussion in Section 2.3. By analyzing the advantages and
disadvantages of the linear estimation method proposed in [6], we provide the theoretical
basis of the proposed algorithm of this thesis. The operation of linear estimation is shown in
Equation 2.3, which is reproduced here:
xMy
= (4.1)
where the vector x
is the set of parameters representing the narrow-band spectral envelope;
vector y
is the set of parameters representing the narrow-band spectral envelope; M is the
matrix composed of estimation parameters.
The choice of parameters is the critical factor in the linear-estimation performance.
Among the parameters that can describe the spectral envelope, LSF is a good candidate.
8/3/2019 Haifeng Ms Thesis
42/78
32
Initially proposed and proved in [27][30], it is a set of parameters equivalent to LPC
coefficients. Its definition starts from the following functions:
1
1
)1(
11
)(
1)(+
+++=
+++=
zazazzB
zazazA
M
MM
MM
(4.2)
where 1a , 2a , , Ma are the LPC coefficients, and A(z) is the transfer function of the linear-
prediction error filter. Further let us define the following two functions:
)()()(
)()()(
zBzAzQ
zBzAzP
+=
= (4.3)
Given P(z) and Q(z), we can calculate A(z), and hence the LPC coefficients, as follows:
2
)()()(
zQzPzA
+= (4.4)
It was proven in [27] that P(z) and Q(z) can be factorized as follows:
( ) ( )
( ) ( )
( ) ( )
( )
=
=
=
=
+=
+=
++=
+=
Mi
i
Mi
i
Mii
Mi
i
zzzQ
zzzzP
numberoddanisMIf
zzzzQ
zzzzP
numberevenaisMIf
,3,1
21
1,4,2
212
1,3,1
211
,4,2
211
cos21)(
cos211)(
,
cos211)(
cos211)(
,
(4.5)
where
8/3/2019 Haifeng Ms Thesis
43/78
33
portion of the spectrum; a scattered distribution represents the low magnitude portion; a close
pair of LSF values represents a peak in the spectrum.
LSF has some desirable properties: when LSF values fall in the range (0,), the
recovered LPC filter has guaranteed stability; local errors of LSF values only cause local
spectral distortion. Therefore, linear estimation based on LSF values is more tolerant to
estimation errors, as a single error cannot harm the whole spectral envelope. Linear-
estimation algorithms using LSF were proposed in [6][21], and shown to yield superior
performance to codebook methods.
In [6], the high-band LSFs are estimated from low-band LSFs by the following
equation:
Aff lh = (4.6)
Where hf and lf are the 81 vectors of the high-band LSFs and the low-band LSFs
respectively. The 88 matrix A is calculated from training data by the following equation:
( ) hT
ll
T
l FFFFA1
= (4.7)
where the matrices lF and hF are obtained from training data. The rows of lF and hF
consist of samples of lf or hf respectively. For each frame, the LSFs of the narrow-band
signal are divided by 2 and then fed into Equation 4.6 as lf . The estimated hf is combined
with lf , and the whole set is used as wide-band LSFs to generate the wide-band spectral
envelope. To improve the performance of linear estimation, speech frames are divided into 4
groups based on the first two reflection coefficients, k1 and k2, as shown in the Table 4.1.
Four matrices are trained and used separately for each group.
8/3/2019 Haifeng Ms Thesis
44/78
34
Speech class Reflection coefficients
Class 1 k1-0.7 k20.55Class 2 k10.55
Class 4 k1-0.7 k20.55
Table 4.1.--Classification of [6] algorithm.
This algorithm makes the assumption that the frequency ranges (0,/2) and (/2,)
have the same number of LSF values in wide-band speech. For example, when the LPC order
is 24, there are always 12 LSFs in (0,/2) and 12 LSFs in (/2,). This assumption is not
true for all frames, and the actual distribution is (11,13) or (13,11) with a probability of
around 50% in our training by TIMIT database. Particularly, the high-frequency consonants,
which are of special interest to our problem, have mainly the distribution of (11,13). This
reflects the fact that the speech energy is concentrated in the high-band. To compensate for
this drawback, the proposed algorithm artificially disperses the LSF values in lf before
feeding it into the linear estimation, for consonant frames. The classification component of
the proposed algorithm provides a group of fricative-consonant frames, and it is used as the
criterion of applying this operation. The implementation can be found in Section 4.2.2.
Figure 4.1 shows the theoretical upper limit of [6] algorithm performance, as the
output signal is synthesized using the original wide-band spectral envelope. The recovered
consonants are weak due to lack of energy in the narrow-band signal. When the original
speech is transmitted in a telephone line, consonants lose most of their energy in the high-
band. When we do the LPC analysis on a narrow-band consonant frame, this lack of energy
is reflected in the residual signal. The residual of a consonant frame was found to be about
20dB lower in magnitude than that of a vowel frame. Therefore, an amplifying operation is
8/3/2019 Haifeng Ms Thesis
45/78
35
needed for residual signals of consonant frames. The implementation can be found in Section
4.2.1.
Figure 4.1. Lack of energy in consonant frames. (a) Original wide-band sentence
spectrogram. (b) Sentence synthesized by original envelopes and residual spectrum folding.
Another shortcoming is the classification by fixed thresholds. In real life, different
speakers have different distributions on the k1-k2 plane and therefore have different
thresholds. Ideal classification criteria should be adaptive to speakers by making use of the
information from other frames in the same sentence. In this thesis, we propose a
classification strategy based on Hidden Markov Models (HMM). This statistical method
makes thresholds soft, and therefore is more robust when working on speakers with various
voice characteristics. The derivation and implementation can be found in Section 4.2.3.
8/3/2019 Haifeng Ms Thesis
46/78
Figure 4.2. Diagrammatic representation of the bandwidth-extension algorithm.
8/3/2019 Haifeng Ms Thesis
47/78
37
4.2 Proposed Algorithm for Bandwidth Extension
This section describes the implementation details of the proposed algorithm. The overall
system flow is shown in Figure 4.2.
The narrow-band speech signal, with an 8kHz sampling rate, is processed on a frame-
by-frame basis using a 20-ms Hanning window and a 10-ms overlap between adjacent
frames. The windowed speech frame is analyzed using an LPC analyzer of order 12. The
output of the analyzer goes to three branches:
1. Residual extension. The narrow-band residual signal passes through a spectrum
folding function where the sample rate is doubled and the low-band spectrum is
copied to the high-band. Additional amplification is then applied to consonant
frames, and the resulting sequence is sent out as the wide-band residual.
2. Envelope extension. Narrow-band LPC coefficients are converted into 12
narrow-band LSF values. They are divided by 2, pre-processed and then fed into
the linear estimation, as shown in Equation 4.6. After eliminating invalid
numbers, the estimated 12 high-band LSF values, together with the low-band LSF
values, are converted back into order-24 wide-band LPC coefficients.
3. Classification of speech frames. Parameters are extracted from a narrow-band
speech frame. Using this information, a classification decision is made based on a
HMM model, and this decision is used in both of the previous branches.
Finally, the wide-band residual passes through the LPC synthesizer (of order 24)
constructed by the wide-band LPC coefficients. The output is the desired wide-band speech,
sampled at the rate of 16kHz, with recovered high-band information.
8/3/2019 Haifeng Ms Thesis
48/78
38
Sections 4.2.1, 4.2.2 and 4.2.3 provide detailed description of these three branches
respectively.
Figure 4.3. Spectrum folding.
4.2.1 Residual Extension
The residual extension module implements the spectrum folding method proposed in [20].
The sample rate is doubled from 8kHz to 16kHz. The odd-index sample values of the wide-
band residual signal are copied from the narrow-band residual, and the even-index sample
values are zeroed. This process is shown in the following equation:
8/3/2019 Haifeng Ms Thesis
49/78
39
Nkky
kxky
...,,2,10)2(
)()12(
==
= (4.8)
where y(n) is the wide-band residual and x(n) is the narrow-band residual. This time-domain
operation is equivalent to folding the (0, 4000) Hz spectrum to (4000, 8000) Hz in frequency
domain, as illustrated in the Figure 4.3.
Since the narrow-band residual has a flat spectrum, the spectrum is still flat after the
folding. For unvoiced frames, the narrow-band residual is a noise-like sequence, and the
wide-band residual inherits the same property; for voiced frames, the harmonic peaks in the
narrow-band spectrum are copied to the high-band. As has been discussed in Section 2.2.2,
the harmonic structure is disrupted at 4kHz, but this drawback does not cause perceivable
artifact in the output speech.
As discussed in Section 4.1, spectrum folding is not enough for consonant frames, as
their residual signal needs to be amplified to the energy level of a real wide-band consonant.
In order for the algorithm to be adaptive to different speakers, we use the average energy of
previous vowel residuals in the same sentence as the reference energy level. The
vowel/consonant decisions are provided by the classification branch. The amplification
process is shown in the following equation:
8.0
2
1
2
)(
)(1
)()(
=
=
n
N
i n
i
ne
nvN
nene (4.9)
where e(n) is the consonant residual sequence, )(nvi is a vowel residual sequence, and N is
the number of previous vowel frames. The parameter 0.8 is an empirical number, which
represents the trade-off between the need to amplify the signal and to maintain the energy
8/3/2019 Haifeng Ms Thesis
50/78
40
variation of different consonants. It was proven to be valid by experimental results, and
examples can be found in Section 4.3.3.
Figure 4.4. Envelope extension. Markers indicate LSF values.
4.2.2 Envelope Extension
The envelope extension module implements a linear estimation method using LSF. The
estimation and training equations are the same as [6] algorithm, as shown in equations 4.6
and 4.7 respectively. Figure 4.4 shows an example of the spectral envelopes before and after
8/3/2019 Haifeng Ms Thesis
51/78
41
the extension, together with the distribution of LSF values. Please note that in the x-axis of
the upper figure corresponds to 4kHz, while in the lower figure corresponds to 8kHz.
Among the 24 LSF values describing the wide-band LPC spectral envelope, the first
12 LSFs come from the narrow-band signal and determine the envelope in the low-band.
They are calculated by dividing the narrow-band LSF values by 2, therefore their mutual
location remains unchanged and they still describe the same envelope. This fact can be seen
in Figure 4.4 where the first half of the lower envelope is a compressed copy of the upper
envelope. The second half of the 24 LSF values is generated by the linear estimation, and
determines how accurate we can estimate the high-band information. The estimation errors
cause spectral distortion in the output speech.
As discussed in Section 4.1, high-frequency consonant frames mainly have the LSF
distribution of (11,13). Therefore, when we train the estimation matrix for the consonant
group using Equation 4.7, most rows of the matrix lF have a number exceeding /2. But the
vector lf we feed into the estimation is composed of numbers limited to (0, /2), and this
mismatch between working data and training data substantially affects the performance. The
results include invalid output LSF values exceeding and severe spectral distortion at
highest frequencies.
The solution proposed in our algorithm is to artificially disperse the LSF values in lf
for consonant frames, according to the following equation:
0825.1,5.1)12( =< lll fffIf (4.10)
where 1.5 and 1.0825 are empirical parameters. This operation make the maximum number
in lf approach or exceed /2, emulating the situation in a real wide-band frame. As a by-
8/3/2019 Haifeng Ms Thesis
52/78
42
product, some distortion is introduced to the low-band: the magnitude is decreased and the
spectral peaks are moved towards high frequencies. Fortunately, this distortion is negligible
for a consonant. The following figure compares the original wide-band envelope and the
envelopes estimated with and without this operation, all with the distribution of LSF values.
Figure 4.5. Effect of artificial dispersion.
The last step of the envelope extension module is eliminating invalid values.
According to the definition, LSF values must fall in the range of (0, ). The possible errors in
8/3/2019 Haifeng Ms Thesis
53/78
43
this linear estimation are values over . In this case, all the wide-band LSFs are scaled down
proportionally, as shown in the following equation:
)24(/05.3,05.3)24( vvvvIf => (4.11)
where v is the vector of LSF values. The reason for setting the upper limit to 3.05 instead of
is that a LSF very close to may cause whistling noise. The occurrence of invalid values
means estimation failure, and the purpose of this last operation is to smooth errors out. In
fact, when a good classification technique is used, such errors are very rare, and are not
perceivable due to the overlap between adjacent frames.
4.2.3 Classification of Speech Frames
From Sections 4.2.1 and 4.2.2, we have seen that classification decisions are used in training,
estimation-matrix choice, residual adjustment, and the pre-processing of linear estimation.
The accuracy of this classification of speech frames is critical to the performance of the
whole algorithm.
The proposed classification strategy in this thesis is based on a Hidden Markov
Model (HMM) [26]. Speech frames are divided into four classes: class 1 includes vowels and
semi-vowel consonants; class 2 includes nasal consonants; class 3 includes silence and weak
consonants (stops); class 4 includes fricative consonants. Class 4 is of special interest in
bandwidth extension. The definition of the four classes is given below in terms of the
phonemes of each class (TIMIT phonetic symbols are used):
Class 1={'aa' 'aw' 'ay' 'ah' 'ao' 'oy' 'ow' 'uh' 'l' 'r' 'w' 'el' 'iy' 'ih' 'eh' 'ey' 'ae' 'uw' 'ux'
'er' 'ax' 'ix' 'axr' 'ax-h' 'b' 'dx' 'q' 'v' 'dh' 'y'}
Class 2={'m' 'n' 'ng' 'em' 'en' 'eng' 'nx'}
8/3/2019 Haifeng Ms Thesis
54/78
44
Class 3={'g' 'p' 'k' 'hh' 'hv' 'pau' 'epi' 'h#' 'tcl' 'kcl' 'bcl' 'gcl' 'pcl' 'dcl'}
Class 4={'jh' 'ch' 'sh' 'zh' 'd' 't' 's' 'z' 'f' 'th'}
Since each speech frame belongs to one of the four classes, a sentence can be viewed
as a sequence of states, with each state indicating the current class. This state sequence is
hidden, and the purpose of our classification strategy is to find out this hidden sequence
based on information extracted from the speech signal.
Three parameters are chosen as the basis of classification. The first parameter is k1,
the first reflection coefficient of the speech frame. The second parameter is the zero-crossing
rate of the speech frame, defined as follows:
( ) ( )=
=N
m
mxsignmxsignN
Z1
)1()(2
1 (4.12)
where
8/3/2019 Haifeng Ms Thesis
55/78
45
where T is the number of frames in the sentence. Further considering
],...,,[
],...,,,,...,[],...,,,...,[
21
21212121
T
TTTT
OOOP
OOOqqqPOOOqqqP = (4.15)
where the denominator is a constant for a given sentence, an alternative expression of the
classification problem is given by the following equation.
{ }],...,,,,...,[maxarg},...,,{ 2121
4,3,2,1,...,21
21
TTqqq
T OOOqqqPqqqT
= (4.16)
In order to evaluate the probability in Equation 4.16, we need to build a statistical
model to describe the random processes },...,,{ 21 Tqqq and },...,,{ 21 TOOO . This HMM
model is specified by the following parameters [26]:
1. The initial state distributions:
4,3,2,1,][ 1 === iiqPi (4.17)
2. The state transition probabilities:
}4,3,2,1{,,][ 1 === + jiiqjqPa ttij (4.18)
3. The observation distributions in each state (here we use a discrete distribution):
4,3,2,1,][)( ==== iiqOOPOb tti (4.19)
These parameters are calculated from training sentences in the TIMIT database, and
the phonetic description files in TIMIT are used to provide the real classification decisions in
training. More details about training will be given later.
After the model is built, i.e., after the parameters are calculated, the solution of
Equation 4.16 can be computed by the Viterbi algorithm, which is computationally efficient.
First we define a quantity )(it as follows:
{ }4,3,2,1,],...,,,,,...,,[max)( 21121
4,3,2,1,..., 121
===
iOOOiqqqqPi tttqqq
tt
(4.20)
8/3/2019 Haifeng Ms Thesis
56/78
46
The Viterbi procedure to find the best state sequence is given below [26]:
1. Initialization:
4,3,2,1,)()( 11 == iObi ii (4.21)4,3,2,1,0)(1 == ii (4.22)
2. Recursion (for t=2, 3, , T):
4,3,2,1,)(])([max)( 14,3,2,1
== =
jObaij tjijti
t (4.23)
4,3,2,1,])([maxarg)( 14,3,2,1
== =
jaij ijti
t (4.24)
3. Termination:
)]([maxarg4,3,2,1
iq Ti
T =
= (4.25)
4. Backtracking:
1...,,2,1,)( 11 == ++ TTtqq ttt (4.26)
In practice, all the multiplications above are implemented by additions in the log
domain, because direct multiplications of a large number of small values can exceed the
resolution limit of any computer. The computational complexity of this procedure is on the
order of 16T (T is the number of frames in the sentence), in addition to the extraction of {k1,
Z, E} for each frame. This complexity is much more than the stationary classification
technique in Table 4.1, but is still feasible in real time. However, in a real-time
implementation, we cannot make use of the whole sentence. In practice, we have found that
the performance of the Viterbi algorithm based on 30 frames is close to that based on a whole
sentence, and therefore a real-time implementation can use 30 as the decision size. The delay
caused by this algorithm will be 15 20ms = 0.3 second, which is still acceptable.
8/3/2019 Haifeng Ms Thesis
57/78
47
The costs discussed in the previous paragraph are overshadowed by the robustness
introduced by the new bandwidth-extension algorithm. This robustness is shown when
dealing with abnormal frames and abnormal speakers. A single frame with unusual
parameters can be put into the correct group as the decision for this frame takes into
consideration other nearby frames. Also, other classification techniques based on fixed
thresholds will suffer when used for a speaker with different parameter distribution from the
training data. On the other hand, the proposed classification technique has soft thresholds and
is more tolerant to such abnormal input.
The observation distribution in each state is a three-dimension discrete distribution
and is defined as follows:
4,3,2,1,],,[],,1[][)( ======== iiqzyxPiqZEkPiqOOPOb tttti (4.27)
where {x, y, z} is the quantization result of {k1, E, Z}, computed as follows:
=
8/3/2019 Haifeng Ms Thesis
58/78
48
}34,...,3,2{,,,4,3,2,1,
12
)1,,()1,,(
12
),1,(),1,(),,1(),,1(
2
),,(),,(
=++
+
++++++=
zyxizyxbzyxb
zyxbzyxbzyxbzyxbzyxbzyxb
ii
iiiiii
(4.29)
4.3 Evaluation and Results
Subjective and objective measures are commonly used for evaluating speech-processing
strategies. To evaluate this bandwidth extension algorithm, the ideal criterion is obviously a
subjective test, in which hearing-impaired people listen to the narrow-band speech and
recovered wide-band speech, and make judgment about the quality. But in a subjective test,
the result can be affected by many factors such as the degree of hearing loss, users'
experience in using telephones, and the choice of sentences. The results of a subjective test
often cannot be reproduced and therefore are not suitable for comparing different signal-
processing algorithms.
Objective measures, on the other hand, can be clearly defined and easily repeated.
Two algorithms can be compared fairly by calculating an objective measure using the same
speech database. However, the results of objective measures are often not highly correlated
with those of subjective measures. When a signal is said to have the lowest distortion
according to an objective test, it may not have the highest intelligibility and naturalness and
be preferred by human listeners. We still do not fully understand how human ears process
speech signals and therefore cannot define the optimal objective measure to model speech
quality.
Among current objective measures, the Itakura-Saito (IS) spectral distance and Log
Likelihood Ratio (LLR) are widely used, as they have relatively modest correlation with
8/3/2019 Haifeng Ms Thesis
59/78
49
speech intelligibility [25]. We use both of them to evaluate the proposed bandwidth extension
algorithm. Further more, a frequency-domain signal-to-noise ratio (SNR) is defined to
evaluate the envelope-extension performance. An accuracy measure is defined to measure the
classification performance. The definitions and results of these four measures are provided in
Section 4.3.2. Sentence examples are presented in Section 4.3.3.
4.3.1 Test Material
The TIMIT database is a standard speech database widely used by speech-processing
researchers, and it is the source of all the training sentences and testing sentences in this
thesis [18]. The original sentences are all wide-band speech with a 16kHz sampling rate. The
TIMIT sentences were low-pass filtered and then down-sampled to generate the narrow-band
TIMIT sentences (4kHz bandwidth). Each objective measure is calculated for male-speaker
sentences, female-speaker sentences, and mixed sentences. In the male or female case, the
training data is 250 sentences, and the testing data is 10 sentences; in the mixed case, the
training data is 250 male sentences and 250 female sentences, and the testing data is 10 male
sentences and 10 female sentences. There is no overlap between the training data and the
testing data.
4.3.2 Objective Measures
The four objective measures are discussed below followed by the test results.
8/3/2019 Haifeng Ms Thesis
60/78
50
Itakura-Saito (IS) distance measureIS and LLR are spectral distance measures based on the all-pole LPC model. They
compare the original signal and the distorted signal on a frame-by-frame basis. The IS
distance is the most widely used measure of spectral distortion, as it takes into
consideration both the spectral envelope and the frame energy. The IS distance of one
frame is computed as follows:
1log10
+
=
Txxx
Tyyy
Tyyy
Tyxy
IS
aRa
aRa
aRa
aRad (4.30)
where xa and xR are the linear prediction coefficient vector and the autocorrelation
matrix of the original speech frame respectively; ya and yR are the linear prediction
coefficient vector and the autocorrelation matrix of the estimated speech frame to be
evaluated.
Since the IS measure changes when the energy of the target sentence changes, the
original wide-band sentence and the estimated wide-band sentence are first normalized to
have the same total energy. Then IS distance values are calculated for each pair of 20ms
frames taken from the two signals. The results are averaged over all the frames, with the
highest 10% of the IS distance values discarded to smooth out meaningless large
numbers. The average IS distance is calculated as follows:
=%909.0
1
lower
ISIS dN
d (4.31)
where N is the number of frames.
Figure 4.6 shows the performance comparison between the [6] algorithm and the
proposed algorithm. A denotes the [6] algorithm; C denotes the proposed algorithm
8/3/2019 Haifeng Ms Thesis
61/78
51
in this thesis, with the HMM classification strategy; B denotes the proposed algorithm
without the HMM part, but using adjusted classification thresholds given in Table 4.2. In
all the cases, the average IS distance is calculated over all testing sentences. The two
versions of the proposed algorithm have consistently lower speech distortion values,
which corresponds to higher speech quality.
Figure 4.6. IS distance comparison of the algorithms.
Speech class Reflection coefficients
Class 1 k1-0.7 k20.55Class 2 k10.55
Class 4 k1-0.3 k20.55
Table 4.2.--Adjusted thresholds used in B.
8/3/2019 Haifeng Ms Thesis
62/78
52
Log Likelihood Ratio (LLR)As the origin of IS measure, LLR only involves the distortion of LPC spectral envelope
and has a much clearer physical significance. The LLR measure of one speech frame is
computed as follows:
=
Txxx
Tyxy
LLR
aRa
aRad 10log (4.32)
where xa and xR are the linear prediction coefficient vector and the autocorrelation
matrix of the original speech frame respectively; ya is the linear prediction coefficient
vector of the estimated speech frame to be evaluated. In the time domain, the
denominator can be viewed as the optimal prediction error energy, and the numerator can
be viewed as the prediction error energy using the estimated LPC coefficients. From
Wiener filtering theory, the denominator is the minimum possible energy value and is
only achieved when using the perfect LPC coefficients. Therefore, the numerator is
always larger than the denominator, and the LLR value is always non-negative. The
larger the LLR value is, the more different the estimation LPC coefficients are from the
real ones, in the error-energy sense. In the frequency domain, the LLR can be
reformulated as follows [25]:
+=
+
2)(
)()(1log
2
10
d
eA
eAeAd
j
x
j
y
j
x
LLR (4.33)
where )(xA and )(yA are the LPC spectrum of the original speech frame and the
generated speech frame respectively. Equation 4.33 can be viewed as a weighted sum of
the spectral envelope distortion at all frequencies with high weighting put in the formant
8/3/2019 Haifeng Ms Thesis
63/78
53
frequencies of the original signal. Therefore, LLR mainly models the mismatch between
the formants of the two speech frames.
LLR values are calculated for each pair of 20ms frames taken from the two signals.
The results are averaged over all the frames, with the highest 10% of the LLR values
discarded to smooth out meaningless large numbers. The average LLR is calculated as
follows:
=%909.0
1
lower
LLRLLR dN
d (4.34)
where N is the number of frames.
Figure 4.7. LLR measure comparison of the algorithms.
8/3/2019 Haifeng Ms Thesis
64/78
54
Figure 4.7 gives the bar plots of the average LLR values of the [6] algorithm and the
proposed algorithm for male, female, and mixed cases. The proposed algorithm shows a
consistently lower speech distortion.
Frequency-domain signal-to-noise ratio (SNR)The frame-based segmental SNR is another popular method for evaluating speech
quality. It is defined as the ratio of the signal energy to the noise energy in decibels.
Because the phase information is lost when the bandwidth-extension algorithm processes
the residual signal, the time-domain SNR is not suitable for measuring the performance.
As an alternative, we propose the frequency-domain SNR (denoted as SNRf), which is
defined in the following equation:
( )
=
deAeA
deASNRf
j
y
j
x
j
x
2
2
10
)()(
)(log10 (4.35)
where )(xA and )(yA are the LPC spectra of the original speech frame and the estimated
speech frame respectively. The frequency-domain SNR represents the distortion of the
LPC spectral envelope. The difference between this measure and LLR is that with the
SNRf measure the distortions at all frequency components are treated equally.
SNRf values are calculated for each pair of 20ms frames taken from the two signals.
The results are averaged over all the frames, and the average SNRf is calculated as
follows:
=
=N
n
nSNRfN
SNRf1
)(1
(4.36)
where N is the number of frames.
8/3/2019 Haifeng Ms Thesis
65/78
55
Figure 4.8. Frequency-domain SNR comparison of the algorithms.
Figure 4.8 shows the bar chart comparing the frequency-domain SNR values of the
estimated wide-band speech produced by the two algorithms. The proposed algorithm
shows a 2dB SNRf gain over the [6] algorithm. This attributed to the improvement in the
envelope extension branch of the algorithm.
Classification accuracy measureAs has been discussed in Section 4.2, the accuracy of the classification decisions is
critical to the performance of the whole algorithm, and the accuracy of identifying
fricative-consonant frames is particularly important. Therefore, when calculating this
8/3/2019 Haifeng Ms Thesis
66/78
56
measure, we divide speech frames into only 2 groups: fricative-consonant frames and
other frames. The accuracy is defined as t