Haifeng Ms Thesis

8/3/2019 Haifeng Ms Thesis

1/78

BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR

TELEPHONE-ASSISTIVE APPLICATIONS

APPROVED BY SUPERVISORY COMMITTEE:

Dr. Philipos Loizou, Chair

Dr. Andrea Fumagalli

Dr. Murat Torlak


2/78

Copyright 2002

Haifeng Qian

All Rights Reserved


3/78



by

HAIFENG QIAN

THESIS

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

THE UNIVERSITY OF TEXAS AT DALLAS

May 2002


4/78

iv

ACKNOWLEDGEMENTS

I would like to thank my adviser, Dr. Philipos Loizou, for his guidance in my research. He

has offered me many helpful suggestions throughout my two-year graduate study.

I would also like to thank Dr. Andrea Fumagalli and Dr. Murat Torlak, for their valuable

feedback on this manuscript.

Thanks also go to my coworkers in the Speech Processing Lab, for their cooperation and

friendship. It has been my pleasure to work with them. Dr. Oguz Poroy, a former member of

the lab, helped me with the hardware building in this thesis.

I would also like to take this opportunity to thank National Institutes of Health for supporting

the research under Grant R01 DC03421.


5/78

v



Haifeng Qian, M.S.E.E.

The University of Texas at Dallas, 2002

Supervising Professor: Dr. Philipos C. Loizou

This thesis addresses the problem of helping hearing-impaired people to use telephones.

There are two aspects of this work: a Bluetooth-based wireless phone adapter and a

bandwidth-extension algorithm. Built upon the Bluetooth technology, the proposed phone

adapter routes the telephone audio signal to the hearing aid or the CI processor wirelessly,

and hence disables environmental noise and interference. The proposed bandwidth-extension

algorithm has the potential to increase speech intelligibility for the hearing-impaired people

by estimating a wide-band signal from the narrow-band telephone signal. This is done by a

piecewise linear estimation based on line spectral frequencies, and a statistical speech-frame

classification technique based on Hidden Markov Models integrated to overcome the

drawback of conventional bandwidth extension algorithms. The phone adapter was tested by

CI users, and the proposed algorithm was evaluated by objective measures. Both results

showed good performance.


6/78

vi

TABLE OF CONTENTS

ACKNOWLEDGEMENTS..................................................................................................... iv

ABSTRACT.............................................................................................................................. v

LIST OF FIGURES................................................................................................................viii

LIST OF TABLES.................................................................................................................... x

1. INTRODUCTION................................................................................................................ 1

2. LITERATURE REVIEW..................................................................................................... 3

2.1 Assistive listening devices ............................................................................................ 3

2.1.1 Hardwired devices................................................................................................ 5

2.1.2 Induction loop devices ......................................................................................... 6

2.1.3 Frequency modulation devices............................................................................. 6

2.1.4 Frequency modulation devices............................................................................. 7

2.2 Telephone recognition by CI users ............................................................................... 8

2.3 Speech enhancement by bandwidth extension............................................................ 11

2.3.1 Fundamentals of bandwidth extension............................................................... 12

2.3.2 Residual extension ............................................................................................. 13

2.3.3 Codebook method .............................................................................................. 15

2.3.4 Linear estimation method................................................................................... 17

3. BLUETOOTH-BASED PHONE ADAPTER.................................................................... 20

3.1 Introduction to Bluetooth ............................................................................................ 20

3.2 Phone adapter design................................................................................................... 233.3 Hardware design.......................................................................................................... 24

3.4 Software design for wireless link................................................................................ 26

3.5 Testing......................................................................................................................... 29

4. BANDWIDTH EXTENSION OF TELEPHONE SPEECH.............................................. 31

4.1 Linear estimation method for bandwidth extension.................................................... 31


7/78

vii

4.2 Proposed algorithm for bandwidth extension ............................................................. 37

4.2.1 Residual extension ............................................................................................. 38

4.2.2 Envelope extension ............................................................................................ 40

4.2.3 Classification of speech frames.......................................................................... 43

4.3 Evaluation and results ................................................................................................. 48

4.3.1 Test material....................................................................................................... 49

4.3.2 Objective measures ............................................................................................ 49

4.3.3 Examples............................................................................................................ 57

5. CONCLUSIONS................................................................................................................ 60

BIBLIOGRAPHY................................................................................................................... 63

VITA


8/78

viii

LIST OF FIGURES

Figure 2.1. Three components of an ALD ................................................................................ 4

Figure 2.2. Spectrograms of the narrow-band (top) and the wide-band (bottom) speech. ..... 11

Figure 2.3. Architecture of bandwidth extension systems...................................................... 12

Figure 2.4. Residual extension by nonlinear distortion........................................................... 14

Figure 2.5. Envelope extension by codebook mapping. ......................................................... 16

Figure 3.1. Structure of a Bluetooth stack............................................................................... 22

Figure 3.2. Architecture of the phone adapter......................................................................... 24

Figure 3.3. Hardware design. .................................................................................................. 25

Figure 3.4. Phone adapter prototype. ...................................................................................... 26

Figure 3.5. Emulating the Headset Profile.............................................................................. 27

Figure 3.6. Message flow. ....................................................................................................... 28

Figure 4.1. Lack of energy in consonant frames. (a) Original wide-band sentence

spectrogram. (b) Sentence synthesized by original envelopes and residual spectrum

folding. .............................................................................................................................. 35

Figure 4.2. Diagrammatic representation of the bandwidth-extension algorithm. ................. 36

Figure 4.3. Spectrum folding. ................................................................................................. 38

Figure 4.4. Envelope extension. Markers indicate LSF values............................................... 40

Figure 4.5. Effect of artificial dispersion. ............................................................................... 42

Figure 4.6. IS distance comparison of the algorithms............................................................. 51

Figure 4.7. LLR measure comparison of the algorithms. ....................................................... 53

Figure 4.8. Frequency-domain SNR comparison of the algorithms. ...................................... 55

Figure 4.9. Classification performance. .................................................................................. 56

Figure 4.10. Comparison of spectrograms. (a) Original wide-band speech. (b) Estimated

wide-band speech by [6] algorithm. (c) Estimated wide-band speech by proposed

algorithm without HMM. (d) Estimated wide-band speech by proposed algorithm with

HMM................................................................................................................................. 58


9/78

ix

Figure 4.11. Comparison of spectrograms. (a) Original wide-band speech. (b) Estimated

wide-band speech by [6] algorithm. (c) Estimated wide-band speech by proposed

algorithm without HMM. (d) Estimated wide-band speech by proposed algorithm with

HMM................................................................................................................................. 59


10/78

x

LIST OF TABLES

Table 4.1.--Classification of [6] algorithm. ............................................................................ 34

Table 4.2.--Adjusted thresholds used in B. ......................................................................... 51


11/78

1

CHAPTER ONE

INTRODUCTION

Hearing-impaired people, including hearing aid users and cochlear implant (CI) users, often

have difficulty talking through telephones. The intelligibility of telephone speech is

considerably lower than the intelligibility of person-to-person speech. This degradation

results mainly from the following three factors:

1. Lack of visual cues. In a person-to-person conversation, a hearing aid or CI user

often uses lip-reading or other visible cues to help understanding the other person.

When talking on the phone, the audio signal is the only information source that he

can make use of.

2. Loss of high-frequency information. Telephone speech is band-limited to 300Hz-

3400Hz. The spectrum above 3.4kHz, present primarily in fricative consonants

such as 's' 'sh' 'ts' etc., is lost. This results in the muffling effect of telephone

sound, which does not affect normal-hearing people, but greatly affects the

hearing impaired.

3. Additional noise introduced by the interaction between the phone and the hearing

aid. The electromagnetic coupling effect of the phone-handset circuit and the

hearing aid coil results in the feedback and amplification of background noise

[33]. A cellphone's electromagnetic emission is often picked up by a hearing aid

as a buzzing noise [3]. Also, the performance of a CI decreases when using


12/78

2

cellphones, and different CI processors are not compatible with certain kinds of

cellphones [28].

To address the third problem, phone adapters have been proposed to help the hearing

impaired in telephone conversation. They are one category of assistive listening devices that

route the audio signal to the hearing aid or CI and hence maximize the signal-to-noise ratio

(SNR) [32][34]. In this thesis, we propose a wireless phone adapter based on Bluetooth

technology, a brand-new transmission technology.

To address the second problem, algorithms can be designed to process telephone

speech to improve intelligibility [6][14][33]. Effort has been put in bandwidth extension

techniques that aim at recovering the lost consonants from the narrow-band telephone

speech. In this thesis, we propose a linear estimation method based on line spectral

frequencies (LSF).

This thesis is organized as follows: Chapter 2 is a review on current assistive listening

devices and bandwidth extension methods that have been developed; Chapter 3 proposes a

Bluetooth-based phone adapter to address the third problem; Chapter 4 proposes a bandwidth

extension algorithm to solve the second problem; Chapter 5 presents conclusions and future

work.


13/78

3

CHAPTER TWO

LITERATURE REVIEW

In this chapter, we provide a literature review on assistive listening devices (ALD), telephone

recognition by CI users, and speech bandwidth extension algorithms.

ALD is a general category of devices that is used to help hearing-impaired people in

different applications. They have been developed using all kinds of modern technologies [3]

[19]. Phone adapters are a special group of ALDs that send the telephone audio directly to

the hearing aid or CI in order to minimize exposure to environmental noise.

Telephone usage is one of the major concerns of CI users. Studies [8][28] have shown

that a certain percentage of CI users do not feel comfortable using the telephone. Their

conversation quality is limited by various problems [16], and improvements are needed.

Bandwidth extension algorithms try to improve the general intelligibility by doubling

the sample rate of telephone speech to recover the lost high-frequency information.

Sections 2.1, 2.2 and 2.3 below provide literature reviews on ALDs, CI telephone

comprehension, and bandwidth extension respectively.

2.1 Assistive Listening Devices

Assistive listening devices (ALD) aim at improving the quality of life of hearing-impaired

people. With the help of hearing aids or cochlear implants, hearing-impaired people are

usually able to have a person-to-person communication in a quiet environment. However,

when there is ambient noise or interference, the hearing-impaired people suffer much more


14/78

4

degradation than normal-hearing people. Therefore, they often need ALDs designed to pick

up audio signals from the desired source and minimize the undesired interferences.

To ensure the rights of hearing-impaired people, auxiliary services are required

according to the Americans with Disabilities Act (Public Law 336 of the 101st Congress),

which was enacted on July 26, 1990. Public services, operated by government or private

entities, must provide hearing-impaired people the service that is functionally equivalent to

that of normal-hearing people [29]. The auxiliary services include "qualified interpreters,

assistive listening devices, notetakers, and written materials".

Figure 2.1. Three components of an ALD.

An assistive listening device is usually composed of 3 parts: a sound-pickup

component, a sound-generating component, and a transfer component that connects the

previous two. The sound-pickup component, most commonly a microphone, picks up audio

signal from a person, a TV, a stereo, or a telephone. The signal is routed to the ALD user by

hardwire or wireless technology, then is processed, amplified, and sent to the hearing aid or

the processor of the cochlear implant user.


15/78

5

Different applications have different needs and require different designs. No single

solution is optimal for all scenarios. Based on the method of sending the audio signal from

the sound-pickup component to the sound-generating component, assistive listening devices

fall into two categories: hardwired devices and wireless devices. The wireless devices can be

further classified into three categories: induction loop, frequency modulated (FM), and

infrared light, named after the transmission technologies used [3]. The different types of

ALDs are discussed in the following sections. More detail is provided for the wireless

devices.

2.1.1 Hardwired Devices

The obvious advantage of hardwired ALDs is that the transfer of sound by a cord is free of

electronic interference. However, for the same reason, they have the disadvantage of losing

mobility. For a personal ALD, the user is limited within a few meters from the sound source;

for a large assistive listening system installed in an auditorium, users are restricted to specific

seats.

A typical example of hardwired ALDs is a currently available phone adapter for

hearing-impaired people. The adapter plugs in between the phone-base and the phone-

handset, takes the speech signal out, and provides an audio output jack that can be connected

to the hearing aid or the CI processor. The user can listen through the adapter while still

talking to the phone-handset [32]. It avoids the sound degradation caused by the phone-

handset speaker and the environmental noise, and therefore provides the user better

conversation quality. This adapter may not work with all phones; since it is hardwired, the

cord length confines the user.


16/78

6

Another recently proposed ALD for CI can also be classified as a hardwired device.

An in-the-ear microphone is connected to the CI input. The user only needs to hold the

phone-handset, as normal, and the in-the-ear microphone picks up the sound [34]. It is small

and convenient in size, compatible with all phones, and the environmental noise is partially

blocked as the phone-handset itself can serve as a seal.

2.1.2 Induction Loop Devices

In induction loop ALDs, audio signal received from the desired source is amplified and then

sent to a wire loop that surrounds the room. The alternating current, carrying the signal,

generates an alternating magnetic field inside the room. A coil on the user side picks up this

magnetic field, and an inductive current is generated inside the coil, carrying the desired

audio signal. The coil can be the input to the hearing aids or CI processors.

The advantage of induction loop ALDs is their simple installation. For hearing aids

with a telecoil switch, the ALD requires nothing from the user. Simply walk into the room

and switch to "telecoil" option, and the user is ready to listen through the ALD. However,

induction loop devices are vulnerable to electromagnetic interference. Electrical installations

and wires in the room, or another induction loop ALD nearby, are all possible sources of

interference. For the above reasons, induction loop ALDs are typically used in large public

facilities, such as classrooms and auditoriums.

2.1.3 Frequency Modulation (FM) Devices

FM ALDs use frequency modulation technology as the transmission method. The frequency

variation around the carrier frequency represents the audio information. The user uses a


17/78

7

receiver to demodulate the radio frequency signal and retrieve audio signal, which is then

sent to a hearing aid or a CI processor.

The advantages of FM ALDs are their portability, large coverage, and the ability to

broadcast several signals in different channels at the same time. On the other hand, they are

more complicated and expensive than induction loop ALDs; they are subject to interference

of radio signals, which may come from radio broadcast or another FM ALD nearby; there is

also a lack of privacy. FM ALDs are widely used as both personal ALDs and large assistive

listening systems.

2.1.4 Infrared Light Devices

Infrared light technology is similar to FM except that the signal carrier, infrared light, is

directional and it cannot penetrate opaque objects (such as walls).

This property brings the obvious advantage of privacy, as the signal is limited to

inside the room. Also, because an infrared light ALD does not have interference coming from

adjacent rooms and is resistant to radio interference, it provides a higher audio quality than

FM devices. However, infrared light ALDs are the most complicated and expensive of the

three kinds; their high power consumption usually can not be supported by batteries, and

therefore can not be portable; the receiver has to avoid sunlight, as the infrared part of the

sunlight can be a fatal interference to the desired signal. Due to the features of infrared light,

these ALDs are mostly used for home applications.

In Chapter 3, a Bluetooth-based phone adapter is proposed. It belongs to the group of

FM ALDs. By taking advantage of the brand-new Bluetooth technology, it overcomes the


18/78

8

pitfalls of traditional FM ALDs and offers better sound quality and resistance to

environmental noise.

2.2 Telephone Recognition by CI Listeners

An important indicator of the life quality of a CI user is whether he is able to carry a

conversation over the telephone, in the absence of lip-reading cues. Telephone usage is part

of many CI rehabilitation programs [31] [2]. According to the survey result in [8], 51% of

Ineraid CI implantees initiate calls and 66% of them answer calls in daily life. In another

recent questionnaire result reported in [28], 51 out of 61 Finnish respondents use telephones.

However, telephone competence shown in these studies was mostly limited to familiar callers

and familiar topics.

Different tests have been designed to evaluate the telephone ability of CI users. In [4]

(1985), one of the earliest studies on this topic, one CI implantee with high performance was

chosen to listen to Central Institute for the Deaf (CID) sentences over telephone and to repeat

them. She obtained 21% of keywords correctly and 47% when listening twice. In a more

systematic study reported in [7] (1989), subjects were tested with sentences sent through an

extension-telephone call, a local call and a long-distance call. The results showed that 23% of

their patients had a significant degree of telephone ability, and that a 50% or higher score in

CID sentences test was a good indicator of telephone competence. The second conclusion

was also confirmed by [8]. Another more recent study, [17] (1998), reported that 68% of the

adult Clarion CI users were able to understand at least half of the sentences over the phone,

and half were able to understand at least 75% of the sentences, 12 months post implantation.


19/78

9

Tests designed for prelingually deaf children with CI are different from the above test

for postlingually deaf adults, as these children have no telephone experience and have a

limited vocabulary. In [2], six children were tested with monosyllables, 2-syllable words and

3-syllable words presented through telephone. The average percentage of correct responses

ranged from 50% to 83% for different materials, and some of the children began to use

telephones after this training program. A larger-scale study was reported in [31], which tested

150 prelingually deaf children ranging from 1 year to 5 years after implantation. This

hierarchical test started from recognizing rings and went up to carrying open conversation

with unfamiliar callers. The performance of the children increased significantly over time

and approached the level of normal-hearing children after 5 years.

Although the above results are encouraging, most CI users are not able to have an

interactive conversation with unfamiliar callers about unfamiliar topics, and they describe the

telephone speech quality as weak, hollow, tinny, having echo, fuzzy, or distorted in other

ways [8]. A detailed survey about telephone problems was done in [16]. It collected

information mainly from hearing-aid users, but the results were also applicable for CI users.

Background noise was a problem for 94% of the respondents; 76% of them thought

telephone speech was too soft; 66% of them reported lack of clarity, and this could not be

solved by amplification. 70% of the subjects found coupling a hearing aid with a telephone to

be problematic due to feedback effects, and nearly half of them preferred not to use their

hearing aids with telephones. The respondents also showed a strong desire for improvements

of ALDs.

In [28], the compatibility between CIs and cellular phones were explored. Digital

phones generate a broad-spectrum radio signal, which appears to CI processors as noise. The


20/78

10

test results showed that Neucleus CI systems are not compatible with GSM phones, while

Combi 40+ systems are compatible with the GSM phones tested.

There are currently several phone adapters and ALDs for CI users. In [19], three

commercial FM ALD products were evaluated with CI users in a noisy environment. All

subjects demonstrated much higher recognition performance with the help of FM ALDs.

Two widely used phone adapters are TEL-001 (Williams Sound) and TLP-102 (DynaMetric).

Both of them are hardwired ALDs. They plug into the handset jack of a normal telephone,

and provide a direct access to the telephone audio in the form of a mono-plug, which can be

fed into a CI processor. Thus the environmental noise is disabled, and the feedback problem

is avoided. A detailed description of TEL-001, as well as an in-the-ear microphone solution

proposed in [34], can be found in Section 2.1.1.

Speech intelligibility can also be improved for hearing-impaired people by using

signal-processing techniques. Special strategies can be designed to compensate for their

hearing loss. In [33], a frequency shaping method was proposed. It amplified the audio signal

frequency-selectively based on the knowledge of hearing loss at different frequencies. When

evaluated in an intelligibility test, the algorithm combining frequency shaping and frequency-

selective amplitude compression achieved the most speech enhancement: 15-30% increase in

recognition.

Different from the above user-dependent signal processing strategies, bandwidth

extension algorithms improve the general intelligibility by recovering lost high-frequency

information. An overview of bandwidth extension algorithms is given next.


21/78

11

2.3 Speech Enhancement by Bandwidth Extension

The telephone speech signal in current telecommunication networks is band-limited to 300-

3400Hz, while the speech bandwidth spans the range of 50Hz to 8000Hz. Figure 2.2 shows

the spectrograms of the narrow-band signal and the wide-band signal. The loss of

information in [50, 300] Hz and [3400, 8000] Hz range causes a muffled effect. For normal-

hearing people, the narrow-band telephone signal is already good enough for intelligibility,

and they prefer a wide-band signal only because it sounds more natural. For the hearing

impaired, the loss of high-frequency consonants is one of the main reasons for the difficulty

in using telephones.

Figure 2.2. Spectrograms of the narrow-band (top) and the wide-band (bottom) speech.

Due to the redundancy nature of human speech, the lost information can be

recovered, at least partially, from the narrow-band signal. Algorithms, such as [6] [9] [14],

have been proposed to solve this problem. Such algorithms can be implemented at the user

end, and therefore require no change for the telephone network. Also, in [35], a coding


22/78

12

method is proposed to recover wide-band speech accurately at the expense of additional low-

bitrate transmission of side-information.

2.3.1 Fundamentals of Bandwidth Extension

The low-band part, 50-300Hz, of speech mainly contributes to speech quality, and little to

intelligibility. In [35], the low-band signal is represented by two sinusoids. In [21], this is

done by spectral envelope extension and inserting sinusoids into the residual. The

performance of both methods depends on the accuracy of pitch detection, which is sometimes

unreliable.

In this thesis, we mainly focus on recovering the high-frequency band, i.e. 3400-

8000Hz, of telephone speech. The typical architecture of a bandwidth extension system is

shown in the Figure 2.3.

Figure 2.3. Architecture of bandwidth extension systems.


23/78

13

The whole algorithm can be viewed as two separate processes: the residual extension

and the spectral envelope extension. An LPC (Linear Prediction Coding) analyzer extracts

the spectral envelope from the input narrow-band signal. The residual extension module

processes the resulting residual signal, while the envelope extension module predicts the

wide-band spectral envelope, based on the 300-3400Hz portion. The desired signal is then

synthesized by using the wide-band residual and the wide-band LPC coefficients.

Residual extension is discussed in Section 2.3.2. Two main methods of envelope

extension, codebook mapping and linear estimation, are discussed in Section 2.3.3 and 2.3.4,

respectively.

2.3.2 Residual Extension

A short frame of speech signal can be modeled as an autoregressive (AR) random process.

The residual signal is the linear prediction error sequence, defined by the following equation:

=

=M

k

k knsansne1

)()()( (2.1)

where 1a , 2a , , Ma are the LPC coefficients, )(ne is the residual signal, and )(ns is the

speech signal.

The residual signal has a flat spectrum like white noise. In a voiced frame, such as

vowels and semi-vowel consonants, the residual noise has periodicity. This appears as

harmonic peaks in addition to the flat noise-like spectrum. These peaks occur in multiples of

the pitch, the fundamental voice frequency of the speaker.

Therefore, the task of the residual extension module is to double the sampling rate,

from 8kHz to 16kHz, while keeping the whole spectrum flat. If there are harmonics in the


24/78

14

narrow-band residual, the wide-band residual should also have the harmonic structure. There

are two methods in common use that accomplish that:

1. Nonlinear distortion method. As is shown in Figure 2.4, the narrow-band residual is

first upsampled by interpolation and then fed into a nonlinear function. The distorted

signal will have the desired bandwidth and harmonic structure over the whole

spectrum. After the whitening filter, the spectrum is flattened and the wide-band

residual is achieved. A popular nonlinear function is given below:

2/)]()1()()1[()( txtxty ++= (2.2)

where x(t) is the input signal, y(t) is the distorted output signal, and is a parameter

between 0 and 1 [20]. When =1, it becomes the absolute value function, which is

used in [35] and achieves good results.

Figure 2.4. Residual extension by nonlinear distortion.

2. Spectrum folding method. This time-domain method proposed in [20] is easy to

implement. The upsampling of the narrow-band residual is done by inserting zeros

instead of interpolating. This is equivalent to folding the spectrum of 0-4000Hz to

4000-8000Hz in the frequency domain. Since the low-frequency spectrum is flat and

has harmonics, the resulting wide-band residual will also have a flat spectrum and


25/78

15

harmonics in both the low-frequency part and the high-frequency part. One drawback

of this method is that the harmonic structure is broken at 4kHz. A possible solution is

to change the sampling rate to a multiple of the pitch before performing the folding,

but this requires accurate pitch detection. Another disadvantage is that harmonics in a

real wide-band residual should have descending amplitudes, but in the folding

method, the harmonics at highest frequencies are the reflection of the harmonics at

lowest frequencies and therefore have the same amplitudes. Fortunately, these details

of residual signal do not affect the speech intelligibility substantially, and the

spectrum folding method is widely used.

2.3.3 Codebook Method

Codebook mapping is a popular method to achieve spectral envelope extension [5][9][14].

For this application, the codebook, as is shown in Figure 2.5, consists of two columns. The

first column contains vectors composed of spectral parameters extracted from the narrow-

band signal, while the second column contains vectors extracted from the corresponding

wide-band signal. When an input frame comes in, the parameters are extracted from it and

compared with the vectors in the first column. By vector quantization, the vector closest to

the input parameters is found, and the corresponding wide-band parameters are taken from

the second column to generate the extended spectral envelope.

The codebook is generated from a large training database of speech. Eligible

parameters to be used in the codebook are LPC coefficients, reflection coefficients, LSF,

cepstral coefficients, etc.


26/78

16

Figure 2.5. Envelope extension by codebook mapping.

An obvious limitation of codebook mapping method is that the number of possible

outputs is decided by the codebook size. Also, when the parameters of two narrow-band

frames belong to the same group in vector quantization, the corresponding wide-band frames

do not necessarily belong to the same group. The probability of such mismatches increases

when the size of codebook increases [21].

To address the above problems, improved versions of codebook mapping were

proposed:

1. Codebook plus interpolation. When a set of input parameters comes in, a

number of closest codebook items are found. The output is computed as the

weighted sum of the corresponding wide-band parameter-vectors, based on a

certain statistical model [5].

2. Multiple codebooks. Speech frames are classified into several groups, and one

codebook is trained and used separately for each group. In [9], two codebooks


27/78

17

were trained and used for voiced and unvoiced frames, and the performance was

found to be superior to other codebook methods.

3. Statistical codebook searching. In order to reduce mismatching, when making

the decision on an upcoming frame, the information from a number of previous

frames are taken into consideration to find the codebook item with the highest

probability. In [14], a codebook search method was proposed, based on hidden

Markov models (HMM).

2.3.4 Linear Estimation Method

Instead of codebook mapping, spectral envelope extension also can be done by linear

estimation. Vector x

, the set of parameters representing the narrow-band spectral envelope,

is first extracted from an input signal frame. Then the corresponding vector y

representing

the wide-band envelope is calculated by feeding x

into a group of linear filters. This is

shown in the following equation:

xMy

= (2.3)

where M is the matrix composed of filter parameters. Then the output spectral envelope is

generated based on vector y

.

If we look at each item in y

and the corresponding row in M, Equation 2.3 can be

viewed as N separate equations, as follows:

Nkxkwky ...,,2,1)()( ==

(2.4)


28/78

18

where N is the dimension ofy

and x

; y(k) is the thk element ofy

; )(kw

is the thk row of

M. To generate M from a training database, a large number of yx

pairs are extracted,

and the optimal )(kw

is found by minimize the following cost function:

( ) ,...,,2,1)()(),()(1

NkixkwikykEL

i

===

(2.5)

where L is the size of the training data; )(ix

is the input vector x

in the thi pair of training

data; ),( iky is the thk element of the output vector y

in the thi pair of training data.

From the Wiener filtering theory, the optimal tap weight vector is given below:

( ) NkkYXXXkw TTopt ...,,2,1)()(1

==

(2.6)

where X and Y are composed of training data, and each row consists of one sample ofx

or

y

, respectively; )(kY

is the thk column of matrix Y.

Combining the N equations in Equation (2.6), the final training solution is given

below:

( ) YXXXM TT 1= (2.7)

The advantage of linear estimation method is that it requires much less memory and

computation than the codebook mapping method, and this is a desirable property when

implementing the system. One disadvantage is that the solution may yield invalid values

representing a LPC filter with unstable impulse response. Therefore special adjustments have

to be added to avoid invalid results, and this might introduce artifacts. Also, since we try to

use a linear model to describe the nonlinear relation of narrow-band and wide-band

parameters, a certain degree of distortion can be expected.


29/78

19

To compensate for the nonlinearity, speech classification techniques are combined

with linear estimation, as proposed in [23] and [6]. This is also known as piecewise linear

estimation. Speech frames are classified into several groups, and estimation matrixes are

trained and used for each group separately.

The key factor in the linear estimation method is the choice of parameters to be used.

Various speech parameters, such as LPC coefficients, LSF, and reflection coefficients, can

represent the spectral envelope. In [9], a linear estimation using sub-band log energies is

compared with a codebook mapping method, and shows higher spectral distortion, even with

a classification into eight groups.

LSFs are good candidates. Several linear estimation algorithms using LSFs have been

proposed. In [21], the envelope extension by estimating LSFs was compared with a

codebook-mapping counterpart using both an objective distortion measure and a subjective

evaluation, and showed a better performance. In [6], the number of LSFs, i.e. the order of

LPC analysis, doubled for the wide-band signal. The lower half of expanded-signal LSFs are

calculated by dividing the narrow-band LSF values by 2. This is equivalent to copying the

narrow-band spectrum to the low-band of the output signal. Thus the transparency of the

system in 300-3400Hz band is guaranteed.


30/78

20

CHAPTER THREE

BLUETOOTH-BASED PHONE ADAPTER

As discussed in Chapter 2, one of the main difficulties for hearing impaired people to use

telephones is caused by the noise introduced by the interaction between the phone and the

hearing aid. Thus phone adapters are needed to route the audio signal directly to the hearing

aid or CI processor.

This chapter proposes a wireless phone adapter, which falls into the category of FM

assistive listening devices. This adapter is based on Bluetooth technology, which is a new

wireless transmission standard. The favorable features of this technology make the adapter

superior to traditional ALDs.

3.1 Introduction to Bluetooth

Wireless technology has dramatically changed the way people interact with one another and

receive information. Bluetooth, a short-distance wireless communication standard, aims at

replacing cables and therefore making the world truly wireless. It defines a universal radio

interface, through which devices within 10 meters can form short-distance ad hoc networks.

By using the dynamic Bluetooth links between these mobile devices, a large number of new

products and services will become possible.

The physical carrier of Bluetooth connections is the 2.4-2.5 GHz band, which is

available for public use in most countries. This band is divided into 79 1-MHz-width

channels, and each channel is divided into 625-s-length time slots. The modulation scheme


31/78

21

used for one time slot of one channel is based on binary frequency shift keying (BFSK). A

Bluetooth link, with one side called "the master device" and the other side called "the slave

device", uses one channel in each time slot and jumps to another channel in the next time

slot. The sequence of channels to be used is a pseudo-random sequence decided by the

master device. The above scheme is called Frequency Hopping Code Division Multiple

Access (FHCDMA). The device that initiates the ad hoc network becomes the master, while

the other devices in the network are slaves. Two sides of a link alternately transmit and

receive, i.e., there is only one-way traffic in one time slot. There are two kinds of links:

synchronous connection-oriented (SCO) links and asynchronous connectionless (ACL) links.

An SCO link is composed of evenly spaced pairs of time slots in the hopping sequence, with

a 64Kb/s bit-rate; ACL links use the slots not reserved by SCO links. One Bluetooth link can

support ACL links and up to 3 SCO links at the same time, and its theoretical maximum bit-

rate is 1Mb/s [12][29].

To ensure interoperability between Bluetooth devices, protocol layers and application

profiles are defined in the Bluetooth Specification. Figure 3.1 illustrates the structure of a

Bluetooth protocol stack. The layers above and including HCI (Host Controller Interface) can

be viewed as software layers, while the rest of the stack can be viewed as hardware layers.

Baseband and Link Manager layers implement the transport actions described in the previous

paragraph, and provide a command interface, the HCI layer. The L2CAP (Logical Link

Control and Adaptation Protocol) layer divides large packets from higher layers into small

packets for lower-layer transmission and reassembles received small packets into large

packets intelligible to higher layers; the L2CAP layer also supports multiple applications by

assigning logical channels. Thus, TCS (Telephony Control Specification), SDP (Service


32/78

22

Discovery Protocol) and RFCOMM are unaware of physical communication details. The

RFCOMM layer further emulates a serial port, so that many conventional applications can be

used on it with no or minor changes. Application profiles for different scenarios are also

included in the Bluetooth Specification to guide implementation, as two devices following

the same profile have guaranteed compatibility [22].

Figure 3.1. Structure of a Bluetooth stack.

TCS=Telephony Control SpecificationSDP=Service Discovery Protocol

L2CAP=Logical Link Control and Adaptation Protocol

HCI=Host Controller Interface


33/78

23

Currently available products equipped with Bluetooth technology include cellphones,

phone adapters, headsets, PC cards, modems, printers, and printer adapters, which are mainly

used to replace cables. A more important potential of Bluetooth is auto-synchronization.

Personal mobile devices, such as cellphones, PDAs and laptops, can form a mobile network

and always keep updated with each other. Wherever the user goes, the personal devices can

automatically find the local Bluetooth-enabled devices and make use of the information and

services provided. It is projected that in the near future, by 2005, when there will be millions

of Bluetooth-enabled products, the auto-synchronization function will bring great benefit to

users. Places such as stores, cinemas and airports will only need a Bluetooth-enabled

information access point to do all their business.

The key factor for the success of this technology is interoperability. Manufacturers

have to make their products comply with the Bluetooth Specification so that they can

communicate with other products. A minimum requirement is to pass the qualification

program administered by the Bluetooth Special Interest Group (SIG).

3.2 Phone Adapter Design

The phone adapter proposed in this thesis is an application of Bluetooth wireless technology

for hearing impaired people. The Bluetooth link is used to transmit the audio signal. The

architecture of the proposed adapter is illustrated in Figure 3.2.

A pair of Bluetooth transceivers forms the wireless link, and each of them is

connected to a host controller running the software protocol stack. The host controller can be

a PC, a microcontroller, or a digital signal processor (DSP). The telephone signal is

connected to the duplex audio interface of the master device. The audio output of the slave


34/78

24

device is connected to the hearing aid or the CI processor, while the input is connected to a

lapel microphone.

Figure 3.2. Architecture of the phone adapter.

The slave device is first initialized to active slave mode, waiting for the connection

request from the master. When a telephone call comes in, or the user makes a call, the master

device sends out paging messages to find the slave device, and initiates a SCO link. After the

connection is confirmed by both sides, the user can talk through this Bluetooth link without

the need to hold the telephone handset or the need to connect the hearing aid or CI processors

directly to the telephone jack. Since the audio signal is directly transmitted from the phone to

the hearing aid or CI processor, environmental noise is disabled, and the user will be able to

enjoy high speech quality even under extreme noisy situations (e.g., in a crowded restaurant,

in a car).

3.3 Hardware Design

A prototype system was developed in this thesis. In this prototype, a pair of Ericsson's

Bluetooth Starter Kits (EBSK) was used for the transceiver hardware. The PC works as the


35/78

25

host controller of the master device, which is stationary and connected to the telephone. A

Motorola DSP56309 processor is used as the host controller of the slave device and provides

portability of the user side.

Figure 3.3. Hardware design.

Figure 3.3 shows the block diagram of the hardware design of the slave device. The

host I/O port of the DSP56309 Evaluation-Module (EVM) board is programmed to send HCI

commands to EBSK. A sequence of 9 HCI protocol commands is implemented in this device

in assembly language. Their function is to reset the EBSK, to set basic transmission settings,

and to put the EBSK in an active slave mode. In order for this prototype to be stand-alone,

the assembly program is written into the flash memory on the EVM board and is executed

when the DSP is reset. To meet the electrical requirement of the host interface of the EBSK,

a signal amplifying and shifting circuit was designed based on a LM318N chip (National


36/78

26

Semiconductor), and a voltage converter circuit was built upon a LMC7660IN chip (National

Semiconductor) to provide a negative voltage for the amplifier. Figure 3.4 shows a

photograph of the portable slave-side device.

Figure 3.4. Phone adapter prototype.

3.4 Software Design for Wireless Link

The ultimate goal of this project is not only to design a phone adapter, but an ALD that can

receive audio signals from all Bluetooth-enabled sources, such as TVs, stereos, and

computers. An audio source with a Bluetooth transceiver should be able to find this device

with a function description and setup an SCO link, all through the procedures defined in the

Bluetooth Specification. To achieve this interoperability, the host controller of our device

needs to support the L2CAP protocol, the RFCOMM protocol, the SDP protocol, and one

application profile defined in the specification.


37/78

27

The Headset Profile is most similar in function to our phone adapter. It defines the

procedure of setting up a Bluetooth audio link between an audio-gateway device and a

headset device. In our phone adapter, the telephone-side device corresponds to the audio

gateway, and the user-side device corresponds to the headset. The transceiver hardware is

still a pair of EBSK, but the host controllers of both sides are two computers with Ericsson

Bluetooth PC Reference Stack loaded. This software stack is a COM-server (Component

Object Model) in the form of an executable file. It contains HCI, L2CAP, RFCOMM, and

SDP layers, and provides a programming interface [10]. Application programs, written in

C++, communicate with the protocol layers by sending commands and receiving event-

messages. These programs emulate the operations of the audio gateway and the headset, as

defined in the Bluetooth Specification.

Figure 3.5. Emulating the Headset Profile.


38/78

28

The structure of the prototype system is shown in Figure 3.5. For practical reasons,

the user-side device should be portable and cannot be hosted by a computer. Here we make

the assumption that the software portion of this prototype can be implemented using a

microcontroller or a DSP.

Figure 3.6. Message flow (adapted from [10][29]).


39/78

29

Figure 3.6 shows the message flow of setting up a headset link. The application

program first initializes itself by registering at the protocol layers, and starts by writing the

SDP service record as a headset. When the remote audio gateway inquires the function

description, SDP answers with the information written by the program. After answering the

L2CAP and RFCOMM connection requests properly, a virtual serial-port link is established

between the two devices. "RING", an AT command defined in [13], is sent from the audio

gateway to the headset, and the program answers with "AT+CKPD=200", another AT

command [11], which says that the incoming call is accepted. Then the audio gateway

initializes the SCO link that carries the speech signal, and the process is finalized.

The application programs were developed based on sample programs provided with

the Ericsson stack. Those programs were modified for our application. When both host

controllers used our emulating programs, the EBSKs completed the whole process shown in

Figure 3.6.

3.5 Testing

In order to evaluate the effectiveness of the wireless transmission on the quality of audio

signal, three CI users were invited to talk through the phone-adapter prototype. All users are

fitted with the MED-EL CIS-LINK processor. The three CI subjects were using their daily

MED-EL processor with the CIS strategy running at 1000-2000 pulses/second. The audio I/O

of the portable device was split to two mono-jacks, one leading to the audio input of the CI

processor and the other leading to a microphone. The user listened through his CI processor

and talked to the microphone. The stationary side of the prototype was connected to different


40/78

30

audio sources, including a handset with a person talking, a sound-card jack of a computer,

and audio wires taken from a normal telephone.

Good quality was reported by the CI users with a person talking through the handset,

when both sides were within reasonable distance inside the lab (the lab is 7 meters long and 6

meters wide). When the user, holding the portable device, walked outside the room and

closed the door (which has a metal frame), the signal was substantially faded.

In order to verify the interoperability of the software design, the user-side device (the

virtual headset) was tested with an Ericsson T28W cellphone coupled to a DBA10 adapter,

which supports the audio-gateway function in the Headset Profile. When the headset program

was tested with T28W+DBA10, the process hung at a minor step before the "RING"

message. The last message from the remote device was a modem-status-change event on the

virtual connection, and that message was not answered properly. This could be due to

compatibility problems between current Bluetooth products, and further work on this adapter

might need technical support from the manufacturer.


41/78

31

CHAPTER FOUR

BANDWIDTH EXTENSION OF TELEPHONE SPEECH

In Chapter 2, we have seen that both codebook-mapping algorithms and linear-estimation

algorithms have their advantages and disadvantages, and that linear estimation requires much

less memory and computation. This chapter proposes a linear estimation method based on

LSF parameters, combined with speech classification techniques to overcome drawbacks of

generic linear estimation. Section 4.1 provides a discussion about the algorithm proposed in

[6], which motivated the proposed method. Section 4.2 provides a detailed description of the

improved algorithm. Section 4.3 evaluates this algorithm using several objective measures.

4.1 Linear Estimation Method for Bandwidth Extension

In this section, we continue the discussion in Section 2.3. By analyzing the advantages and

disadvantages of the linear estimation method proposed in [6], we provide the theoretical

basis of the proposed algorithm of this thesis. The operation of linear estimation is shown in

Equation 2.3, which is reproduced here:

xMy

= (4.1)

where the vector x

is the set of parameters representing the narrow-band spectral envelope;

vector y

is the set of parameters representing the narrow-band spectral envelope; M is the

matrix composed of estimation parameters.

The choice of parameters is the critical factor in the linear-estimation performance.

Among the parameters that can describe the spectral envelope, LSF is a good candidate.


42/78

32

Initially proposed and proved in [27][30], it is a set of parameters equivalent to LPC

coefficients. Its definition starts from the following functions:

1

1

)1(

11

)(

1)(+

+++=

+++=

zazazzB

zazazA

M

MM

MM

(4.2)

where 1a , 2a , , Ma are the LPC coefficients, and A(z) is the transfer function of the linear-

prediction error filter. Further let us define the following two functions:

)()()(

)()()(

zBzAzQ

zBzAzP

+=

= (4.3)

Given P(z) and Q(z), we can calculate A(z), and hence the LPC coefficients, as follows:

2

)()()(

zQzPzA

+= (4.4)

It was proven in [27] that P(z) and Q(z) can be factorized as follows:

( ) ( )

( ) ( )

( ) ( )

( )

=

=

=

=

+=

+=

++=

+=

Mi

i

Mi

i

Mii

Mi

i

zzzQ

zzzzP

numberoddanisMIf

zzzzQ

zzzzP

numberevenaisMIf

,3,1

21

1,4,2

212

1,3,1

211

,4,2

211

cos21)(

cos211)(

,

cos211)(

cos211)(

,

(4.5)

where


43/78

33

portion of the spectrum; a scattered distribution represents the low magnitude portion; a close

pair of LSF values represents a peak in the spectrum.

LSF has some desirable properties: when LSF values fall in the range (0,), the

recovered LPC filter has guaranteed stability; local errors of LSF values only cause local

spectral distortion. Therefore, linear estimation based on LSF values is more tolerant to

estimation errors, as a single error cannot harm the whole spectral envelope. Linear-

estimation algorithms using LSF were proposed in [6][21], and shown to yield superior

performance to codebook methods.

In [6], the high-band LSFs are estimated from low-band LSFs by the following

equation:

Aff lh = (4.6)

Where hf and lf are the 81 vectors of the high-band LSFs and the low-band LSFs

respectively. The 88 matrix A is calculated from training data by the following equation:

( ) hT

ll

T

l FFFFA1

= (4.7)

where the matrices lF and hF are obtained from training data. The rows of lF and hF

consist of samples of lf or hf respectively. For each frame, the LSFs of the narrow-band

signal are divided by 2 and then fed into Equation 4.6 as lf . The estimated hf is combined

with lf , and the whole set is used as wide-band LSFs to generate the wide-band spectral

envelope. To improve the performance of linear estimation, speech frames are divided into 4

groups based on the first two reflection coefficients, k1 and k2, as shown in the Table 4.1.

Four matrices are trained and used separately for each group.


44/78

34

Speech class Reflection coefficients

Class 1 k1-0.7 k20.55Class 2 k10.55

Class 4 k1-0.7 k20.55

Table 4.1.--Classification of [6] algorithm.

This algorithm makes the assumption that the frequency ranges (0,/2) and (/2,)

have the same number of LSF values in wide-band speech. For example, when the LPC order

is 24, there are always 12 LSFs in (0,/2) and 12 LSFs in (/2,). This assumption is not

true for all frames, and the actual distribution is (11,13) or (13,11) with a probability of

around 50% in our training by TIMIT database. Particularly, the high-frequency consonants,

which are of special interest to our problem, have mainly the distribution of (11,13). This

reflects the fact that the speech energy is concentrated in the high-band. To compensate for

this drawback, the proposed algorithm artificially disperses the LSF values in lf before

feeding it into the linear estimation, for consonant frames. The classification component of

the proposed algorithm provides a group of fricative-consonant frames, and it is used as the

criterion of applying this operation. The implementation can be found in Section 4.2.2.

Figure 4.1 shows the theoretical upper limit of [6] algorithm performance, as the

output signal is synthesized using the original wide-band spectral envelope. The recovered

consonants are weak due to lack of energy in the narrow-band signal. When the original

speech is transmitted in a telephone line, consonants lose most of their energy in the high-

band. When we do the LPC analysis on a narrow-band consonant frame, this lack of energy

is reflected in the residual signal. The residual of a consonant frame was found to be about

20dB lower in magnitude than that of a vowel frame. Therefore, an amplifying operation is


45/78

35

needed for residual signals of consonant frames. The implementation can be found in Section

4.2.1.

Figure 4.1. Lack of energy in consonant frames. (a) Original wide-band sentence

spectrogram. (b) Sentence synthesized by original envelopes and residual spectrum folding.

Another shortcoming is the classification by fixed thresholds. In real life, different

speakers have different distributions on the k1-k2 plane and therefore have different

thresholds. Ideal classification criteria should be adaptive to speakers by making use of the

information from other frames in the same sentence. In this thesis, we propose a

classification strategy based on Hidden Markov Models (HMM). This statistical method

makes thresholds soft, and therefore is more robust when working on speakers with various

voice characteristics. The derivation and implementation can be found in Section 4.2.3.


46/78

Figure 4.2. Diagrammatic representation of the bandwidth-extension algorithm.


47/78

37

4.2 Proposed Algorithm for Bandwidth Extension

This section describes the implementation details of the proposed algorithm. The overall

system flow is shown in Figure 4.2.

The narrow-band speech signal, with an 8kHz sampling rate, is processed on a frame-

by-frame basis using a 20-ms Hanning window and a 10-ms overlap between adjacent

frames. The windowed speech frame is analyzed using an LPC analyzer of order 12. The

output of the analyzer goes to three branches:

1. Residual extension. The narrow-band residual signal passes through a spectrum

folding function where the sample rate is doubled and the low-band spectrum is

copied to the high-band. Additional amplification is then applied to consonant

frames, and the resulting sequence is sent out as the wide-band residual.

2. Envelope extension. Narrow-band LPC coefficients are converted into 12

narrow-band LSF values. They are divided by 2, pre-processed and then fed into

the linear estimation, as shown in Equation 4.6. After eliminating invalid

numbers, the estimated 12 high-band LSF values, together with the low-band LSF

values, are converted back into order-24 wide-band LPC coefficients.

3. Classification of speech frames. Parameters are extracted from a narrow-band

speech frame. Using this information, a classification decision is made based on a

HMM model, and this decision is used in both of the previous branches.

Finally, the wide-band residual passes through the LPC synthesizer (of order 24)

constructed by the wide-band LPC coefficients. The output is the desired wide-band speech,

sampled at the rate of 16kHz, with recovered high-band information.


48/78

38

Sections 4.2.1, 4.2.2 and 4.2.3 provide detailed description of these three branches

respectively.

Figure 4.3. Spectrum folding.

4.2.1 Residual Extension

The residual extension module implements the spectrum folding method proposed in [20].

The sample rate is doubled from 8kHz to 16kHz. The odd-index sample values of the wide-

band residual signal are copied from the narrow-band residual, and the even-index sample

values are zeroed. This process is shown in the following equation:


49/78

39

Nkky

kxky

...,,2,10)2(

)()12(

==

= (4.8)

where y(n) is the wide-band residual and x(n) is the narrow-band residual. This time-domain

operation is equivalent to folding the (0, 4000) Hz spectrum to (4000, 8000) Hz in frequency

domain, as illustrated in the Figure 4.3.

Since the narrow-band residual has a flat spectrum, the spectrum is still flat after the

folding. For unvoiced frames, the narrow-band residual is a noise-like sequence, and the

wide-band residual inherits the same property; for voiced frames, the harmonic peaks in the

narrow-band spectrum are copied to the high-band. As has been discussed in Section 2.2.2,

the harmonic structure is disrupted at 4kHz, but this drawback does not cause perceivable

artifact in the output speech.

As discussed in Section 4.1, spectrum folding is not enough for consonant frames, as

their residual signal needs to be amplified to the energy level of a real wide-band consonant.

In order for the algorithm to be adaptive to different speakers, we use the average energy of

previous vowel residuals in the same sentence as the reference energy level. The

vowel/consonant decisions are provided by the classification branch. The amplification

process is shown in the following equation:

8.0

2

1

2

)(

)(1

)()(

=

=

n

N

i n

i

ne

nvN

nene (4.9)

where e(n) is the consonant residual sequence, )(nvi is a vowel residual sequence, and N is

the number of previous vowel frames. The parameter 0.8 is an empirical number, which

represents the trade-off between the need to amplify the signal and to maintain the energy


50/78

40

variation of different consonants. It was proven to be valid by experimental results, and

examples can be found in Section 4.3.3.

Figure 4.4. Envelope extension. Markers indicate LSF values.

4.2.2 Envelope Extension

The envelope extension module implements a linear estimation method using LSF. The

estimation and training equations are the same as [6] algorithm, as shown in equations 4.6

and 4.7 respectively. Figure 4.4 shows an example of the spectral envelopes before and after


51/78

41

the extension, together with the distribution of LSF values. Please note that in the x-axis of

the upper figure corresponds to 4kHz, while in the lower figure corresponds to 8kHz.

Among the 24 LSF values describing the wide-band LPC spectral envelope, the first

12 LSFs come from the narrow-band signal and determine the envelope in the low-band.

They are calculated by dividing the narrow-band LSF values by 2, therefore their mutual

location remains unchanged and they still describe the same envelope. This fact can be seen

in Figure 4.4 where the first half of the lower envelope is a compressed copy of the upper

envelope. The second half of the 24 LSF values is generated by the linear estimation, and

determines how accurate we can estimate the high-band information. The estimation errors

cause spectral distortion in the output speech.

As discussed in Section 4.1, high-frequency consonant frames mainly have the LSF

distribution of (11,13). Therefore, when we train the estimation matrix for the consonant

group using Equation 4.7, most rows of the matrix lF have a number exceeding /2. But the

vector lf we feed into the estimation is composed of numbers limited to (0, /2), and this

mismatch between working data and training data substantially affects the performance. The

results include invalid output LSF values exceeding and severe spectral distortion at

highest frequencies.

The solution proposed in our algorithm is to artificially disperse the LSF values in lf

for consonant frames, according to the following equation:

0825.1,5.1)12( =< lll fffIf (4.10)

where 1.5 and 1.0825 are empirical parameters. This operation make the maximum number

in lf approach or exceed /2, emulating the situation in a real wide-band frame. As a by-


52/78

42

product, some distortion is introduced to the low-band: the magnitude is decreased and the

spectral peaks are moved towards high frequencies. Fortunately, this distortion is negligible

for a consonant. The following figure compares the original wide-band envelope and the

envelopes estimated with and without this operation, all with the distribution of LSF values.

Figure 4.5. Effect of artificial dispersion.

The last step of the envelope extension module is eliminating invalid values.

According to the definition, LSF values must fall in the range of (0, ). The possible errors in


53/78

43

this linear estimation are values over . In this case, all the wide-band LSFs are scaled down

proportionally, as shown in the following equation:

)24(/05.3,05.3)24( vvvvIf => (4.11)

where v is the vector of LSF values. The reason for setting the upper limit to 3.05 instead of

is that a LSF very close to may cause whistling noise. The occurrence of invalid values

means estimation failure, and the purpose of this last operation is to smooth errors out. In

fact, when a good classification technique is used, such errors are very rare, and are not

perceivable due to the overlap between adjacent frames.

4.2.3 Classification of Speech Frames

From Sections 4.2.1 and 4.2.2, we have seen that classification decisions are used in training,

estimation-matrix choice, residual adjustment, and the pre-processing of linear estimation.

The accuracy of this classification of speech frames is critical to the performance of the

whole algorithm.

The proposed classification strategy in this thesis is based on a Hidden Markov

Model (HMM) [26]. Speech frames are divided into four classes: class 1 includes vowels and

semi-vowel consonants; class 2 includes nasal consonants; class 3 includes silence and weak

consonants (stops); class 4 includes fricative consonants. Class 4 is of special interest in

bandwidth extension. The definition of the four classes is given below in terms of the

phonemes of each class (TIMIT phonetic symbols are used):

Class 1={'aa' 'aw' 'ay' 'ah' 'ao' 'oy' 'ow' 'uh' 'l' 'r' 'w' 'el' 'iy' 'ih' 'eh' 'ey' 'ae' 'uw' 'ux'

'er' 'ax' 'ix' 'axr' 'ax-h' 'b' 'dx' 'q' 'v' 'dh' 'y'}

Class 2={'m' 'n' 'ng' 'em' 'en' 'eng' 'nx'}


54/78

44

Class 3={'g' 'p' 'k' 'hh' 'hv' 'pau' 'epi' 'h#' 'tcl' 'kcl' 'bcl' 'gcl' 'pcl' 'dcl'}

Class 4={'jh' 'ch' 'sh' 'zh' 'd' 't' 's' 'z' 'f' 'th'}

Since each speech frame belongs to one of the four classes, a sentence can be viewed

as a sequence of states, with each state indicating the current class. This state sequence is

hidden, and the purpose of our classification strategy is to find out this hidden sequence

based on information extracted from the speech signal.

Three parameters are chosen as the basis of classification. The first parameter is k1,

the first reflection coefficient of the speech frame. The second parameter is the zero-crossing

rate of the speech frame, defined as follows:

( ) ( )=

=N

m

mxsignmxsignN

Z1

)1()(2

1 (4.12)

where


55/78

45

where T is the number of frames in the sentence. Further considering

],...,,[

],...,,,,...,[],...,,,...,[

21

21212121

T

TTTT

OOOP

OOOqqqPOOOqqqP = (4.15)

where the denominator is a constant for a given sentence, an alternative expression of the

classification problem is given by the following equation.

{ }],...,,,,...,[maxarg},...,,{ 2121

4,3,2,1,...,21

21

TTqqq

T OOOqqqPqqqT

= (4.16)

In order to evaluate the probability in Equation 4.16, we need to build a statistical

model to describe the random processes },...,,{ 21 Tqqq and },...,,{ 21 TOOO . This HMM

model is specified by the following parameters [26]:

1. The initial state distributions:

4,3,2,1,][ 1 === iiqPi (4.17)

2. The state transition probabilities:

}4,3,2,1{,,][ 1 === + jiiqjqPa ttij (4.18)

3. The observation distributions in each state (here we use a discrete distribution):

4,3,2,1,][)( ==== iiqOOPOb tti (4.19)

These parameters are calculated from training sentences in the TIMIT database, and

the phonetic description files in TIMIT are used to provide the real classification decisions in

training. More details about training will be given later.

After the model is built, i.e., after the parameters are calculated, the solution of

Equation 4.16 can be computed by the Viterbi algorithm, which is computationally efficient.

First we define a quantity )(it as follows:

{ }4,3,2,1,],...,,,,,...,,[max)( 21121

4,3,2,1,..., 121

===

iOOOiqqqqPi tttqqq

tt

(4.20)


56/78

46

The Viterbi procedure to find the best state sequence is given below [26]:

1. Initialization:

4,3,2,1,)()( 11 == iObi ii (4.21)4,3,2,1,0)(1 == ii (4.22)

2. Recursion (for t=2, 3, , T):

4,3,2,1,)(])([max)( 14,3,2,1

== =

jObaij tjijti

t (4.23)

4,3,2,1,])([maxarg)( 14,3,2,1

== =

jaij ijti

t (4.24)

3. Termination:

)]([maxarg4,3,2,1

iq Ti

T =

= (4.25)

4. Backtracking:

1...,,2,1,)( 11 == ++ TTtqq ttt (4.26)

In practice, all the multiplications above are implemented by additions in the log

domain, because direct multiplications of a large number of small values can exceed the

resolution limit of any computer. The computational complexity of this procedure is on the

order of 16T (T is the number of frames in the sentence), in addition to the extraction of {k1,

Z, E} for each frame. This complexity is much more than the stationary classification

technique in Table 4.1, but is still feasible in real time. However, in a real-time

implementation, we cannot make use of the whole sentence. In practice, we have found that

the performance of the Viterbi algorithm based on 30 frames is close to that based on a whole

sentence, and therefore a real-time implementation can use 30 as the decision size. The delay

caused by this algorithm will be 15 20ms = 0.3 second, which is still acceptable.


57/78

47

The costs discussed in the previous paragraph are overshadowed by the robustness

introduced by the new bandwidth-extension algorithm. This robustness is shown when

dealing with abnormal frames and abnormal speakers. A single frame with unusual

parameters can be put into the correct group as the decision for this frame takes into

consideration other nearby frames. Also, other classification techniques based on fixed

thresholds will suffer when used for a speaker with different parameter distribution from the

training data. On the other hand, the proposed classification technique has soft thresholds and

is more tolerant to such abnormal input.

The observation distribution in each state is a three-dimension discrete distribution

and is defined as follows:

4,3,2,1,],,[],,1[][)( ======== iiqzyxPiqZEkPiqOOPOb tttti (4.27)

where {x, y, z} is the quantization result of {k1, E, Z}, computed as follows:

=


58/78

48

}34,...,3,2{,,,4,3,2,1,

12

)1,,()1,,(

12

),1,(),1,(),,1(),,1(

2

),,(),,(

=++

+

++++++=

zyxizyxbzyxb

zyxbzyxbzyxbzyxbzyxbzyxb

ii

iiiiii

(4.29)

4.3 Evaluation and Results

Subjective and objective measures are commonly used for evaluating speech-processing

strategies. To evaluate this bandwidth extension algorithm, the ideal criterion is obviously a

subjective test, in which hearing-impaired people listen to the narrow-band speech and

recovered wide-band speech, and make judgment about the quality. But in a subjective test,

the result can be affected by many factors such as the degree of hearing loss, users'

experience in using telephones, and the choice of sentences. The results of a subjective test

often cannot be reproduced and therefore are not suitable for comparing different signal-

processing algorithms.

Objective measures, on the other hand, can be clearly defined and easily repeated.

Two algorithms can be compared fairly by calculating an objective measure using the same

speech database. However, the results of objective measures are often not highly correlated

with those of subjective measures. When a signal is said to have the lowest distortion

according to an objective test, it may not have the highest intelligibility and naturalness and

be preferred by human listeners. We still do not fully understand how human ears process

speech signals and therefore cannot define the optimal objective measure to model speech

quality.

Among current objective measures, the Itakura-Saito (IS) spectral distance and Log

Likelihood Ratio (LLR) are widely used, as they have relatively modest correlation with


59/78

49

speech intelligibility [25]. We use both of them to evaluate the proposed bandwidth extension

algorithm. Further more, a frequency-domain signal-to-noise ratio (SNR) is defined to

evaluate the envelope-extension performance. An accuracy measure is defined to measure the

classification performance. The definitions and results of these four measures are provided in

Section 4.3.2. Sentence examples are presented in Section 4.3.3.

4.3.1 Test Material

The TIMIT database is a standard speech database widely used by speech-processing

researchers, and it is the source of all the training sentences and testing sentences in this

thesis [18]. The original sentences are all wide-band speech with a 16kHz sampling rate. The

TIMIT sentences were low-pass filtered and then down-sampled to generate the narrow-band

TIMIT sentences (4kHz bandwidth). Each objective measure is calculated for male-speaker

sentences, female-speaker sentences, and mixed sentences. In the male or female case, the

training data is 250 sentences, and the testing data is 10 sentences; in the mixed case, the

training data is 250 male sentences and 250 female sentences, and the testing data is 10 male

sentences and 10 female sentences. There is no overlap between the training data and the

testing data.

4.3.2 Objective Measures

The four objective measures are discussed below followed by the test results.


60/78

50

Itakura-Saito (IS) distance measureIS and LLR are spectral distance measures based on the all-pole LPC model. They

compare the original signal and the distorted signal on a frame-by-frame basis. The IS

distance is the most widely used measure of spectral distortion, as it takes into

consideration both the spectral envelope and the frame energy. The IS distance of one

frame is computed as follows:

1log10

+

=

Txxx

Tyyy

Tyyy

Tyxy

IS

aRa

aRa

aRa

aRad (4.30)

where xa and xR are the linear prediction coefficient vector and the autocorrelation

matrix of the original speech frame respectively; ya and yR are the linear prediction

coefficient vector and the autocorrelation matrix of the estimated speech frame to be

evaluated.

Since the IS measure changes when the energy of the target sentence changes, the

original wide-band sentence and the estimated wide-band sentence are first normalized to

have the same total energy. Then IS distance values are calculated for each pair of 20ms

frames taken from the two signals. The results are averaged over all the frames, with the

highest 10% of the IS distance values discarded to smooth out meaningless large

numbers. The average IS distance is calculated as follows:

=%909.0

1

lower

ISIS dN

d (4.31)

where N is the number of frames.

Figure 4.6 shows the performance comparison between the [6] algorithm and the

proposed algorithm. A denotes the [6] algorithm; C denotes the proposed algorithm


61/78

51

in this thesis, with the HMM classification strategy; B denotes the proposed algorithm

without the HMM part, but using adjusted classification thresholds given in Table 4.2. In

all the cases, the average IS distance is calculated over all testing sentences. The two

versions of the proposed algorithm have consistently lower speech distortion values,

which corresponds to higher speech quality.

Figure 4.6. IS distance comparison of the algorithms.

Speech class Reflection coefficients

Class 1 k1-0.7 k20.55Class 2 k10.55

Class 4 k1-0.3 k20.55

Table 4.2.--Adjusted thresholds used in B.


62/78

52

Log Likelihood Ratio (LLR)As the origin of IS measure, LLR only involves the distortion of LPC spectral envelope

and has a much clearer physical significance. The LLR measure of one speech frame is

computed as follows:

=

Txxx

Tyxy

LLR

aRa

aRad 10log (4.32)

where xa and xR are the linear prediction coefficient vector and the autocorrelation

matrix of the original speech frame respectively; ya is the linear prediction coefficient

vector of the estimated speech frame to be evaluated. In the time domain, the

denominator can be viewed as the optimal prediction error energy, and the numerator can

be viewed as the prediction error energy using the estimated LPC coefficients. From

Wiener filtering theory, the denominator is the minimum possible energy value and is

only achieved when using the perfect LPC coefficients. Therefore, the numerator is

always larger than the denominator, and the LLR value is always non-negative. The

larger the LLR value is, the more different the estimation LPC coefficients are from the

real ones, in the error-energy sense. In the frequency domain, the LLR can be

reformulated as follows [25]:

+=

+

2)(

)()(1log

2

10

d

eA

eAeAd

j

x

j

y

j

x

LLR (4.33)

where )(xA and )(yA are the LPC spectrum of the original speech frame and the

generated speech frame respectively. Equation 4.33 can be viewed as a weighted sum of

the spectral envelope distortion at all frequencies with high weighting put in the formant


63/78

53

frequencies of the original signal. Therefore, LLR mainly models the mismatch between

the formants of the two speech frames.

LLR values are calculated for each pair of 20ms frames taken from the two signals.

The results are averaged over all the frames, with the highest 10% of the LLR values

discarded to smooth out meaningless large numbers. The average LLR is calculated as

follows:

=%909.0

1

lower

LLRLLR dN

d (4.34)


Figure 4.7. LLR measure comparison of the algorithms.


64/78

54

Figure 4.7 gives the bar plots of the average LLR values of the [6] algorithm and the

proposed algorithm for male, female, and mixed cases. The proposed algorithm shows a

consistently lower speech distortion.

Frequency-domain signal-to-noise ratio (SNR)The frame-based segmental SNR is another popular method for evaluating speech

quality. It is defined as the ratio of the signal energy to the noise energy in decibels.

Because the phase information is lost when the bandwidth-extension algorithm processes

the residual signal, the time-domain SNR is not suitable for measuring the performance.

As an alternative, we propose the frequency-domain SNR (denoted as SNRf), which is

defined in the following equation:

( )

=

deAeA

deASNRf

j

y

j

x

j

x

2

2

10

)()(

)(log10 (4.35)

where )(xA and )(yA are the LPC spectra of the original speech frame and the estimated

speech frame respectively. The frequency-domain SNR represents the distortion of the

LPC spectral envelope. The difference between this measure and LLR is that with the

SNRf measure the distortions at all frequency components are treated equally.

SNRf values are calculated for each pair of 20ms frames taken from the two signals.

The results are averaged over all the frames, and the average SNRf is calculated as

follows:

=

=N

n

nSNRfN

SNRf1

)(1

(4.36)



65/78

55

Figure 4.8. Frequency-domain SNR comparison of the algorithms.

Figure 4.8 shows the bar chart comparing the frequency-domain SNR values of the

estimated wide-band speech produced by the two algorithms. The proposed algorithm

shows a 2dB SNRf gain over the [6] algorithm. This attributed to the improvement in the

envelope extension branch of the algorithm.

Classification accuracy measureAs has been discussed in Section 4.2, the accuracy of the classification decisions is

critical to the performance of the whole algorithm, and the accuracy of identifying

fricative-consonant frames is particularly important. Therefore, when calculating this


66/78

56

measure, we divide speech frames into only 2 groups: fricative-consonant frames and

other frames. The accuracy is defined as t

Documents

Haifeng Ms Thesis