Haifeng Ms Thesis

Embed Size (px)

Citation preview

  • 8/3/2019 Haifeng Ms Thesis

    1/78

    BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR

    TELEPHONE-ASSISTIVE APPLICATIONS

    APPROVED BY SUPERVISORY COMMITTEE:

    Dr. Philipos Loizou, Chair

    Dr. Andrea Fumagalli

    Dr. Murat Torlak

  • 8/3/2019 Haifeng Ms Thesis

    2/78

    Copyright 2002

    Haifeng Qian

    All Rights Reserved

  • 8/3/2019 Haifeng Ms Thesis

    3/78

    BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR

    TELEPHONE-ASSISTIVE APPLICATIONS

    by

    HAIFENG QIAN

    THESIS

    Presented to the Faculty of

    The University of Texas at Dallas

    in Partial Fulfillment

    of the Requirements

    for the Degree of

    MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

    THE UNIVERSITY OF TEXAS AT DALLAS

    May 2002

  • 8/3/2019 Haifeng Ms Thesis

    4/78

    iv

    ACKNOWLEDGEMENTS

    I would like to thank my adviser, Dr. Philipos Loizou, for his guidance in my research. He

    has offered me many helpful suggestions throughout my two-year graduate study.

    I would also like to thank Dr. Andrea Fumagalli and Dr. Murat Torlak, for their valuable

    feedback on this manuscript.

    Thanks also go to my coworkers in the Speech Processing Lab, for their cooperation and

    friendship. It has been my pleasure to work with them. Dr. Oguz Poroy, a former member of

    the lab, helped me with the hardware building in this thesis.

    I would also like to take this opportunity to thank National Institutes of Health for supporting

    the research under Grant R01 DC03421.

  • 8/3/2019 Haifeng Ms Thesis

    5/78

    v

    BLUETOOTH RECEIVER AND BANDWIDTH-EXTENSION ALGORITHMS FOR

    TELEPHONE-ASSISTIVE APPLICATIONS

    Haifeng Qian, M.S.E.E.

    The University of Texas at Dallas, 2002

    Supervising Professor: Dr. Philipos C. Loizou

    This thesis addresses the problem of helping hearing-impaired people to use telephones.

    There are two aspects of this work: a Bluetooth-based wireless phone adapter and a

    bandwidth-extension algorithm. Built upon the Bluetooth technology, the proposed phone

    adapter routes the telephone audio signal to the hearing aid or the CI processor wirelessly,

    and hence disables environmental noise and interference. The proposed bandwidth-extension

    algorithm has the potential to increase speech intelligibility for the hearing-impaired people

    by estimating a wide-band signal from the narrow-band telephone signal. This is done by a

    piecewise linear estimation based on line spectral frequencies, and a statistical speech-frame

    classification technique based on Hidden Markov Models integrated to overcome the

    drawback of conventional bandwidth extension algorithms. The phone adapter was tested by

    CI users, and the proposed algorithm was evaluated by objective measures. Both results

    showed good performance.

  • 8/3/2019 Haifeng Ms Thesis

    6/78

    vi

    TABLE OF CONTENTS

    ACKNOWLEDGEMENTS..................................................................................................... iv

    ABSTRACT.............................................................................................................................. v

    LIST OF FIGURES................................................................................................................viii

    LIST OF TABLES.................................................................................................................... x

    1. INTRODUCTION................................................................................................................ 1

    2. LITERATURE REVIEW..................................................................................................... 3

    2.1 Assistive listening devices ............................................................................................ 3

    2.1.1 Hardwired devices................................................................................................ 5

    2.1.2 Induction loop devices ......................................................................................... 6

    2.1.3 Frequency modulation devices............................................................................. 6

    2.1.4 Frequency modulation devices............................................................................. 7

    2.2 Telephone recognition by CI users ............................................................................... 8

    2.3 Speech enhancement by bandwidth extension............................................................ 11

    2.3.1 Fundamentals of bandwidth extension............................................................... 12

    2.3.2 Residual extension ............................................................................................. 13

    2.3.3 Codebook method .............................................................................................. 15

    2.3.4 Linear estimation method................................................................................... 17

    3. BLUETOOTH-BASED PHONE ADAPTER.................................................................... 20

    3.1 Introduction to Bluetooth ............................................................................................ 20

    3.2 Phone adapter design................................................................................................... 233.3 Hardware design.......................................................................................................... 24

    3.4 Software design for wireless link................................................................................ 26

    3.5 Testing......................................................................................................................... 29

    4. BANDWIDTH EXTENSION OF TELEPHONE SPEECH.............................................. 31

    4.1 Linear estimation method for bandwidth extension.................................................... 31

  • 8/3/2019 Haifeng Ms Thesis

    7/78

    vii

    4.2 Proposed algorithm for bandwidth extension ............................................................. 37

    4.2.1 Residual extension ............................................................................................. 38

    4.2.2 Envelope extension ............................................................................................ 40

    4.2.3 Classification of speech frames.......................................................................... 43

    4.3 Evaluation and results ................................................................................................. 48

    4.3.1 Test material....................................................................................................... 49

    4.3.2 Objective measures ............................................................................................ 49

    4.3.3 Examples............................................................................................................ 57

    5. CONCLUSIONS................................................................................................................ 60

    BIBLIOGRAPHY................................................................................................................... 63

    VITA

  • 8/3/2019 Haifeng Ms Thesis

    8/78

    viii

    LIST OF FIGURES

    Figure 2.1. Three components of an ALD ................................................................................ 4

    Figure 2.2. Spectrograms of the narrow-band (top) and the wide-band (bottom) speech. ..... 11

    Figure 2.3. Architecture of bandwidth extension systems...................................................... 12

    Figure 2.4. Residual extension by nonlinear distortion........................................................... 14

    Figure 2.5. Envelope extension by codebook mapping. ......................................................... 16

    Figure 3.1. Structure of a Bluetooth stack............................................................................... 22

    Figure 3.2. Architecture of the phone adapter......................................................................... 24

    Figure 3.3. Hardware design. .................................................................................................. 25

    Figure 3.4. Phone adapter prototype. ...................................................................................... 26

    Figure 3.5. Emulating the Headset Profile.............................................................................. 27

    Figure 3.6. Message flow. ....................................................................................................... 28

    Figure 4.1. Lack of energy in consonant frames. (a) Original wide-band sentence

    spectrogram. (b) Sentence synthesized by original envelopes and residual spectrum

    folding. .............................................................................................................................. 35

    Figure 4.2. Diagrammatic representation of the bandwidth-extension algorithm. ................. 36

    Figure 4.3. Spectrum folding. ................................................................................................. 38

    Figure 4.4. Envelope extension. Markers indicate LSF values............................................... 40

    Figure 4.5. Effect of artificial dispersion. ............................................................................... 42

    Figure 4.6. IS distance comparison of the algorithms............................................................. 51

    Figure 4.7. LLR measure comparison of the algorithms. ....................................................... 53

    Figure 4.8. Frequency-domain SNR comparison of the algorithms. ...................................... 55

    Figure 4.9. Classification performance. .................................................................................. 56

    Figure 4.10. Comparison of spectrograms. (a) Original wide-band speech. (b) Estimated

    wide-band speech by [6] algorithm. (c) Estimated wide-band speech by proposed

    algorithm without HMM. (d) Estimated wide-band speech by proposed algorithm with

    HMM................................................................................................................................. 58

  • 8/3/2019 Haifeng Ms Thesis

    9/78

    ix

    Figure 4.11. Comparison of spectrograms. (a) Original wide-band speech. (b) Estimated

    wide-band speech by [6] algorithm. (c) Estimated wide-band speech by proposed

    algorithm without HMM. (d) Estimated wide-band speech by proposed algorithm with

    HMM................................................................................................................................. 59

  • 8/3/2019 Haifeng Ms Thesis

    10/78

    x

    LIST OF TABLES

    Table 4.1.--Classification of [6] algorithm. ............................................................................ 34

    Table 4.2.--Adjusted thresholds used in B. ......................................................................... 51

  • 8/3/2019 Haifeng Ms Thesis

    11/78

    1

    CHAPTER ONE

    INTRODUCTION

    Hearing-impaired people, including hearing aid users and cochlear implant (CI) users, often

    have difficulty talking through telephones. The intelligibility of telephone speech is

    considerably lower than the intelligibility of person-to-person speech. This degradation

    results mainly from the following three factors:

    1. Lack of visual cues. In a person-to-person conversation, a hearing aid or CI user

    often uses lip-reading or other visible cues to help understanding the other person.

    When talking on the phone, the audio signal is the only information source that he

    can make use of.

    2. Loss of high-frequency information. Telephone speech is band-limited to 300Hz-

    3400Hz. The spectrum above 3.4kHz, present primarily in fricative consonants

    such as 's' 'sh' 'ts' etc., is lost. This results in the muffling effect of telephone

    sound, which does not affect normal-hearing people, but greatly affects the

    hearing impaired.

    3. Additional noise introduced by the interaction between the phone and the hearing

    aid. The electromagnetic coupling effect of the phone-handset circuit and the

    hearing aid coil results in the feedback and amplification of background noise

    [33]. A cellphone's electromagnetic emission is often picked up by a hearing aid

    as a buzzing noise [3]. Also, the performance of a CI decreases when using

  • 8/3/2019 Haifeng Ms Thesis

    12/78

    2

    cellphones, and different CI processors are not compatible with certain kinds of

    cellphones [28].

    To address the third problem, phone adapters have been proposed to help the hearing

    impaired in telephone conversation. They are one category of assistive listening devices that

    route the audio signal to the hearing aid or CI and hence maximize the signal-to-noise ratio

    (SNR) [32][34]. In this thesis, we propose a wireless phone adapter based on Bluetooth

    technology, a brand-new transmission technology.

    To address the second problem, algorithms can be designed to process telephone

    speech to improve intelligibility [6][14][33]. Effort has been put in bandwidth extension

    techniques that aim at recovering the lost consonants from the narrow-band telephone

    speech. In this thesis, we propose a linear estimation method based on line spectral

    frequencies (LSF).

    This thesis is organized as follows: Chapter 2 is a review on current assistive listening

    devices and bandwidth extension methods that have been developed; Chapter 3 proposes a

    Bluetooth-based phone adapter to address the third problem; Chapter 4 proposes a bandwidth

    extension algorithm to solve the second problem; Chapter 5 presents conclusions and future

    work.

  • 8/3/2019 Haifeng Ms Thesis

    13/78

    3

    CHAPTER TWO

    LITERATURE REVIEW

    In this chapter, we provide a literature review on assistive listening devices (ALD), telephone

    recognition by CI users, and speech bandwidth extension algorithms.

    ALD is a general category of devices that is used to help hearing-impaired people in

    different applications. They have been developed using all kinds of modern technologies [3]

    [19]. Phone adapters are a special group of ALDs that send the telephone audio directly to

    the hearing aid or CI in order to minimize exposure to environmental noise.

    Telephone usage is one of the major concerns of CI users. Studies [8][28] have shown

    that a certain percentage of CI users do not feel comfortable using the telephone. Their

    conversation quality is limited by various problems [16], and improvements are needed.

    Bandwidth extension algorithms try to improve the general intelligibility by doubling

    the sample rate of telephone speech to recover the lost high-frequency information.

    Sections 2.1, 2.2 and 2.3 below provide literature reviews on ALDs, CI telephone

    comprehension, and bandwidth extension respectively.

    2.1 Assistive Listening Devices

    Assistive listening devices (ALD) aim at improving the quality of life of hearing-impaired

    people. With the help of hearing aids or cochlear implants, hearing-impaired people are

    usually able to have a person-to-person communication in a quiet environment. However,

    when there is ambient noise or interference, the hearing-impaired people suffer much more

  • 8/3/2019 Haifeng Ms Thesis

    14/78

    4

    degradation than normal-hearing people. Therefore, they often need ALDs designed to pick

    up audio signals from the desired source and minimize the undesired interferences.

    To ensure the rights of hearing-impaired people, auxiliary services are required

    according to the Americans with Disabilities Act (Public Law 336 of the 101st Congress),

    which was enacted on July 26, 1990. Public services, operated by government or private

    entities, must provide hearing-impaired people the service that is functionally equivalent to

    that of normal-hearing people [29]. The auxiliary services include "qualified interpreters,

    assistive listening devices, notetakers, and written materials".

    Figure 2.1. Three components of an ALD.

    An assistive listening device is usually composed of 3 parts: a sound-pickup

    component, a sound-generating component, and a transfer component that connects the

    previous two. The sound-pickup component, most commonly a microphone, picks up audio

    signal from a person, a TV, a stereo, or a telephone. The signal is routed to the ALD user by

    hardwire or wireless technology, then is processed, amplified, and sent to the hearing aid or

    the processor of the cochlear implant user.

  • 8/3/2019 Haifeng Ms Thesis

    15/78

    5

    Different applications have different needs and require different designs. No single

    solution is optimal for all scenarios. Based on the method of sending the audio signal from

    the sound-pickup component to the sound-generating component, assistive listening devices

    fall into two categories: hardwired devices and wireless devices. The wireless devices can be

    further classified into three categories: induction loop, frequency modulated (FM), and

    infrared light, named after the transmission technologies used [3]. The different types of

    ALDs are discussed in the following sections. More detail is provided for the wireless

    devices.

    2.1.1 Hardwired Devices

    The obvious advantage of hardwired ALDs is that the transfer of sound by a cord is free of

    electronic interference. However, for the same reason, they have the disadvantage of losing

    mobility. For a personal ALD, the user is limited within a few meters from the sound source;

    for a large assistive listening system installed in an auditorium, users are restricted to specific

    seats.

    A typical example of hardwired ALDs is a currently available phone adapter for

    hearing-impaired people. The adapter plugs in between the phone-base and the phone-

    handset, takes the speech signal out, and provides an audio output jack that can be connected

    to the hearing aid or the CI processor. The user can listen through the adapter while still

    talking to the phone-handset [32]. It avoids the sound degradation caused by the phone-

    handset speaker and the environmental noise, and therefore provides the user better

    conversation quality. This adapter may not work with all phones; since it is hardwired, the

    cord length confines the user.

  • 8/3/2019 Haifeng Ms Thesis

    16/78

    6

    Another recently proposed ALD for CI can also be classified as a hardwired device.

    An in-the-ear microphone is connected to the CI input. The user only needs to hold the

    phone-handset, as normal, and the in-the-ear microphone picks up the sound [34]. It is small

    and convenient in size, compatible with all phones, and the environmental noise is partially

    blocked as the phone-handset itself can serve as a seal.

    2.1.2 Induction Loop Devices

    In induction loop ALDs, audio signal received from the desired source is amplified and then

    sent to a wire loop that surrounds the room. The alternating current, carrying the signal,

    generates an alternating magnetic field inside the room. A coil on the user side picks up this

    magnetic field, and an inductive current is generated inside the coil, carrying the desired

    audio signal. The coil can be the input to the hearing aids or CI processors.

    The advantage of induction loop ALDs is their simple installation. For hearing aids

    with a telecoil switch, the ALD requires nothing from the user. Simply walk into the room

    and switch to "telecoil" option, and the user is ready to listen through the ALD. However,

    induction loop devices are vulnerable to electromagnetic interference. Electrical installations

    and wires in the room, or another induction loop ALD nearby, are all possible sources of

    interference. For the above reasons, induction loop ALDs are typically used in large public

    facilities, such as classrooms and auditoriums.

    2.1.3 Frequency Modulation (FM) Devices

    FM ALDs use frequency modulation technology as the transmission method. The frequency

    variation around the carrier frequency represents the audio information. The user uses a

  • 8/3/2019 Haifeng Ms Thesis

    17/78

    7

    receiver to demodulate the radio frequency signal and retrieve audio signal, which is then

    sent to a hearing aid or a CI processor.

    The advantages of FM ALDs are their portability, large coverage, and the ability to

    broadcast several signals in different channels at the same time. On the other hand, they are

    more complicated and expensive than induction loop ALDs; they are subject to interference

    of radio signals, which may come from radio broadcast or another FM ALD nearby; there is

    also a lack of privacy. FM ALDs are widely used as both personal ALDs and large assistive

    listening systems.

    2.1.4 Infrared Light Devices

    Infrared light technology is similar to FM except that the signal carrier, infrared light, is

    directional and it cannot penetrate opaque objects (such as walls).

    This property brings the obvious advantage of privacy, as the signal is limited to

    inside the room. Also, because an infrared light ALD does not have interference coming from

    adjacent rooms and is resistant to radio interference, it provides a higher audio quality than

    FM devices. However, infrared light ALDs are the most complicated and expensive of the

    three kinds; their high power consumption usually can not be supported by batteries, and

    therefore can not be portable; the receiver has to avoid sunlight, as the infrared part of the

    sunlight can be a fatal interference to the desired signal. Due to the features of infrared light,

    these ALDs are mostly used for home applications.

    In Chapter 3, a Bluetooth-based phone adapter is proposed. It belongs to the group of

    FM ALDs. By taking advantage of the brand-new Bluetooth technology, it overcomes the

  • 8/3/2019 Haifeng Ms Thesis

    18/78

    8

    pitfalls of traditional FM ALDs and offers better sound quality and resistance to

    environmental noise.

    2.2 Telephone Recognition by CI Listeners

    An important indicator of the life quality of a CI user is whether he is able to carry a

    conversation over the telephone, in the absence of lip-reading cues. Telephone usage is part

    of many CI rehabilitation programs [31] [2]. According to the survey result in [8], 51% of

    Ineraid CI implantees initiate calls and 66% of them answer calls in daily life. In another

    recent questionnaire result reported in [28], 51 out of 61 Finnish respondents use telephones.

    However, telephone competence shown in these studies was mostly limited to familiar callers

    and familiar topics.

    Different tests have been designed to evaluate the telephone ability of CI users. In [4]

    (1985), one of the earliest studies on this topic, one CI implantee with high performance was

    chosen to listen to Central Institute for the Deaf (CID) sentences over telephone and to repeat

    them. She obtained 21% of keywords correctly and 47% when listening twice. In a more

    systematic study reported in [7] (1989), subjects were tested with sentences sent through an

    extension-telephone call, a local call and a long-distance call. The results showed that 23% of

    their patients had a significant degree of telephone ability, and that a 50% or higher score in

    CID sentences test was a good indicator of telephone competence. The second conclusion

    was also confirmed by [8]. Another more recent study, [17] (1998), reported that 68% of the

    adult Clarion CI users were able to understand at least half of the sentences over the phone,

    and half were able to understand at least 75% of the sentences, 12 months post implantation.

  • 8/3/2019 Haifeng Ms Thesis

    19/78

    9

    Tests designed for prelingually deaf children with CI are different from the above test

    for postlingually deaf adults, as these children have no telephone experience and have a

    limited vocabulary. In [2], six children were tested with monosyllables, 2-syllable words and

    3-syllable words presented through telephone. The average percentage of correct responses

    ranged from 50% to 83% for different materials, and some of the children began to use

    telephones after this training program. A larger-scale study was reported in [31], which tested

    150 prelingually deaf children ranging from 1 year to 5 years after implantation. This

    hierarchical test started from recognizing rings and went up to carrying open conversation

    with unfamiliar callers. The performance of the children increased significantly over time

    and approached the level of normal-hearing children after 5 years.

    Although the above results are encouraging, most CI users are not able to have an

    interactive conversation with unfamiliar callers about unfamiliar topics, and they describe the

    telephone speech quality as weak, hollow, tinny, having echo, fuzzy, or distorted in other

    ways [8]. A detailed survey about telephone problems was done in [16]. It collected

    information mainly from hearing-aid users, but the results were also applicable for CI users.

    Background noise was a problem for 94% of the respondents; 76% of them thought

    telephone speech was too soft; 66% of them reported lack of clarity, and this could not be

    solved by amplification. 70% of the subjects found coupling a hearing aid with a telephone to

    be problematic due to feedback effects, and nearly half of them preferred not to use their

    hearing aids with telephones. The respondents also showed a strong desire for improvements

    of ALDs.

    In [28], the compatibility between CIs and cellular phones were explored. Digital

    phones generate a broad-spectrum radio signal, which appears to CI processors as noise. The

  • 8/3/2019 Haifeng Ms Thesis

    20/78

    10

    test results showed that Neucleus CI systems are not compatible with GSM phones, while

    Combi 40+ systems are compatible with the GSM phones tested.

    There are currently several phone adapters and ALDs for CI users. In [19], three

    commercial FM ALD products were evaluated with CI users in a noisy environment. All

    subjects demonstrated much higher recognition performance with the help of FM ALDs.

    Two widely used phone adapters are TEL-001 (Williams Sound) and TLP-102 (DynaMetric).

    Both of them are hardwired ALDs. They plug into the handset jack of a normal telephone,

    and provide a direct access to the telephone audio in the form of a mono-plug, which can be

    fed into a CI processor. Thus the environmental noise is disabled, and the feedback problem

    is avoided. A detailed description of TEL-001, as well as an in-the-ear microphone solution

    proposed in [34], can be found in Section 2.1.1.

    Speech intelligibility can also be improved for hearing-impaired people by using

    signal-processing techniques. Special strategies can be designed to compensate for their

    hearing loss. In [33], a frequency shaping method was proposed. It amplified the audio signal

    frequency-selectively based on the knowledge of hearing loss at different frequencies. When

    evaluated in an intelligibility test, the algorithm combining frequency shaping and frequency-

    selective amplitude compression achieved the most speech enhancement: 15-30% increase in

    recognition.

    Different from the above user-dependent signal processing strategies, bandwidth

    extension algorithms improve the general intelligibility by recovering lost high-frequency

    information. An overview of bandwidth extension algorithms is given next.

  • 8/3/2019 Haifeng Ms Thesis

    21/78

    11

    2.3 Speech Enhancement by Bandwidth Extension

    The telephone speech signal in current telecommunication networks is band-limited to 300-

    3400Hz, while the speech bandwidth spans the range of 50Hz to 8000Hz. Figure 2.2 shows

    the spectrograms of the narrow-band signal and the wide-band signal. The loss of

    information in [50, 300] Hz and [3400, 8000] Hz range causes a muffled effect. For normal-

    hearing people, the narrow-band telephone signal is already good enough for intelligibility,

    and they prefer a wide-band signal only because it sounds more natural. For the hearing

    impaired, the loss of high-frequency consonants is one of the main reasons for the difficulty

    in using telephones.

    Figure 2.2. Spectrograms of the narrow-band (top) and the wide-band (bottom) speech.

    Due to the redundancy nature of human speech, the lost information can be

    recovered, at least partially, from the narrow-band signal. Algorithms, such as [6] [9] [14],

    have been proposed to solve this problem. Such algorithms can be implemented at the user

    end, and therefore require no change for the telephone network. Also, in [35], a coding

  • 8/3/2019 Haifeng Ms Thesis

    22/78

    12

    method is proposed to recover wide-band speech accurately at the expense of additional low-

    bitrate transmission of side-information.

    2.3.1 Fundamentals of Bandwidth Extension

    The low-band part, 50-300Hz, of speech mainly contributes to speech quality, and little to

    intelligibility. In [35], the low-band signal is represented by two sinusoids. In [21], this is

    done by spectral envelope extension and inserting sinusoids into the residual. The

    performance of both methods depends on the accuracy of pitch detection, which is sometimes

    unreliable.

    In this thesis, we mainly focus on recovering the high-frequency band, i.e. 3400-

    8000Hz, of telephone speech. The typical architecture of a bandwidth extension system is

    shown in the Figure 2.3.

    Figure 2.3. Architecture of bandwidth extension systems.

  • 8/3/2019 Haifeng Ms Thesis

    23/78

    13

    The whole algorithm can be viewed as two separate processes: the residual extension

    and the spectral envelope extension. An LPC (Linear Prediction Coding) analyzer extracts

    the spectral envelope from the input narrow-band signal. The residual extension module

    processes the resulting residual signal, while the envelope extension module predicts the

    wide-band spectral envelope, based on the 300-3400Hz portion. The desired signal is then

    synthesized by using the wide-band residual and the wide-band LPC coefficients.

    Residual extension is discussed in Section 2.3.2. Two main methods of envelope

    extension, codebook mapping and linear estimation, are discussed in Section 2.3.3 and 2.3.4,

    respectively.

    2.3.2 Residual Extension

    A short frame of speech signal can be modeled as an autoregressive (AR) random process.

    The residual signal is the linear prediction error sequence, defined by the following equation:

    =

    =M

    k

    k knsansne1

    )()()( (2.1)

    where 1a , 2a , , Ma are the LPC coefficients, )(ne is the residual signal, and )(ns is the

    speech signal.

    The residual signal has a flat spectrum like white noise. In a voiced frame, such as

    vowels and semi-vowel consonants, the residual noise has periodicity. This appears as

    harmonic peaks in addition to the flat noise-like spectrum. These peaks occur in multiples of

    the pitch, the fundamental voice frequency of the speaker.

    Therefore, the task of the residual extension module is to double the sampling rate,

    from 8kHz to 16kHz, while keeping the whole spectrum flat. If there are harmonics in the

  • 8/3/2019 Haifeng Ms Thesis

    24/78

    14

    narrow-band residual, the wide-band residual should also have the harmonic structure. There

    are two methods in common use that accomplish that:

    1. Nonlinear distortion method. As is shown in Figure 2.4, the narrow-band residual is

    first upsampled by interpolation and then fed into a nonlinear function. The distorted

    signal will have the desired bandwidth and harmonic structure over the whole

    spectrum. After the whitening filter, the spectrum is flattened and the wide-band

    residual is achieved. A popular nonlinear function is given below:

    2/)]()1()()1[()( txtxty ++= (2.2)

    where x(t) is the input signal, y(t) is the distorted output signal, and is a parameter

    between 0 and 1 [20]. When =1, it becomes the absolute value function, which is

    used in [35] and achieves good results.

    Figure 2.4. Residual extension by nonlinear distortion.

    2. Spectrum folding method. This time-domain method proposed in [20] is easy to

    implement. The upsampling of the narrow-band residual is done by inserting zeros

    instead of interpolating. This is equivalent to folding the spectrum of 0-4000Hz to

    4000-8000Hz in the frequency domain. Since the low-frequency spectrum is flat and

    has harmonics, the resulting wide-band residual will also have a flat spectrum and

  • 8/3/2019 Haifeng Ms Thesis

    25/78

    15

    harmonics in both the low-frequency part and the high-frequency part. One drawback

    of this method is that the harmonic structure is broken at 4kHz. A possible solution is

    to change the sampling rate to a multiple of the pitch before performing the folding,

    but this requires accurate pitch detection. Another disadvantage is that harmonics in a

    real wide-band residual should have descending amplitudes, but in the folding

    method, the harmonics at highest frequencies are the reflection of the harmonics at

    lowest frequencies and therefore have the same amplitudes. Fortunately, these details

    of residual signal do not affect the speech intelligibility substantially, and the

    spectrum folding method is widely used.

    2.3.3 Codebook Method

    Codebook mapping is a popular method to achieve spectral envelope extension [5][9][14].

    For this application, the codebook, as is shown in Figure 2.5, consists of two columns. The

    first column contains vectors composed of spectral parameters extracted from the narrow-

    band signal, while the second column contains vectors extracted from the corresponding

    wide-band signal. When an input frame comes in, the parameters are extracted from it and

    compared with the vectors in the first column. By vector quantization, the vector closest to

    the input parameters is found, and the corresponding wide-band parameters are taken from

    the second column to generate the extended spectral envelope.

    The codebook is generated from a large training database of speech. Eligible

    parameters to be used in the codebook are LPC coefficients, reflection coefficients, LSF,

    cepstral coefficients, etc.

  • 8/3/2019 Haifeng Ms Thesis

    26/78

    16

    Figure 2.5. Envelope extension by codebook mapping.

    An obvious limitation of codebook mapping method is that the number of possible

    outputs is decided by the codebook size. Also, when the parameters of two narrow-band

    frames belong to the same group in vector quantization, the corresponding wide-band frames

    do not necessarily belong to the same group. The probability of such mismatches increases

    when the size of codebook increases [21].

    To address the above problems, improved versions of codebook mapping were

    proposed:

    1. Codebook plus interpolation. When a set of input parameters comes in, a

    number of closest codebook items are found. The output is computed as the

    weighted sum of the corresponding wide-band parameter-vectors, based on a

    certain statistical model [5].

    2. Multiple codebooks. Speech frames are classified into several groups, and one

    codebook is trained and used separately for each group. In [9], two codebooks

  • 8/3/2019 Haifeng Ms Thesis

    27/78

    17

    were trained and used for voiced and unvoiced frames, and the performance was

    found to be superior to other codebook methods.

    3. Statistical codebook searching. In order to reduce mismatching, when making

    the decision on an upcoming frame, the information from a number of previous

    frames are taken into consideration to find the codebook item with the highest

    probability. In [14], a codebook search method was proposed, based on hidden

    Markov models (HMM).

    2.3.4 Linear Estimation Method

    Instead of codebook mapping, spectral envelope extension also can be done by linear

    estimation. Vector x

    , the set of parameters representing the narrow-band spectral envelope,

    is first extracted from an input signal frame. Then the corresponding vector y

    representing

    the wide-band envelope is calculated by feeding x

    into a group of linear filters. This is

    shown in the following equation:

    xMy

    = (2.3)

    where M is the matrix composed of filter parameters. Then the output spectral envelope is

    generated based on vector y

    .

    If we look at each item in y

    and the corresponding row in M, Equation 2.3 can be

    viewed as N separate equations, as follows:

    Nkxkwky ...,,2,1)()( ==

    (2.4)

  • 8/3/2019 Haifeng Ms Thesis

    28/78

    18

    where N is the dimension ofy

    and x

    ; y(k) is the thk element ofy

    ; )(kw

    is the thk row of

    M. To generate M from a training database, a large number of yx

    pairs are extracted,

    and the optimal )(kw

    is found by minimize the following cost function:

    ( ) ,...,,2,1)()(),()(1

    NkixkwikykEL

    i

    ===

    (2.5)

    where L is the size of the training data; )(ix

    is the input vector x

    in the thi pair of training

    data; ),( iky is the thk element of the output vector y

    in the thi pair of training data.

    From the Wiener filtering theory, the optimal tap weight vector is given below:

    ( ) NkkYXXXkw TTopt ...,,2,1)()(1

    ==

    (2.6)

    where X and Y are composed of training data, and each row consists of one sample ofx

    or

    y

    , respectively; )(kY

    is the thk column of matrix Y.

    Combining the N equations in Equation (2.6), the final training solution is given

    below:

    ( ) YXXXM TT 1= (2.7)

    The advantage of linear estimation method is that it requires much less memory and

    computation than the codebook mapping method, and this is a desirable property when

    implementing the system. One disadvantage is that the solution may yield invalid values

    representing a LPC filter with unstable impulse response. Therefore special adjustments have

    to be added to avoid invalid results, and this might introduce artifacts. Also, since we try to

    use a linear model to describe the nonlinear relation of narrow-band and wide-band

    parameters, a certain degree of distortion can be expected.

  • 8/3/2019 Haifeng Ms Thesis

    29/78

    19

    To compensate for the nonlinearity, speech classification techniques are combined

    with linear estimation, as proposed in [23] and [6]. This is also known as piecewise linear

    estimation. Speech frames are classified into several groups, and estimation matrixes are

    trained and used for each group separately.

    The key factor in the linear estimation method is the choice of parameters to be used.

    Various speech parameters, such as LPC coefficients, LSF, and reflection coefficients, can

    represent the spectral envelope. In [9], a linear estimation using sub-band log energies is

    compared with a codebook mapping method, and shows higher spectral distortion, even with

    a classification into eight groups.

    LSFs are good candidates. Several linear estimation algorithms using LSFs have been

    proposed. In [21], the envelope extension by estimating LSFs was compared with a

    codebook-mapping counterpart using both an objective distortion measure and a subjective

    evaluation, and showed a better performance. In [6], the number of LSFs, i.e. the order of

    LPC analysis, doubled for the wide-band signal. The lower half of expanded-signal LSFs are

    calculated by dividing the narrow-band LSF values by 2. This is equivalent to copying the

    narrow-band spectrum to the low-band of the output signal. Thus the transparency of the

    system in 300-3400Hz band is guaranteed.

  • 8/3/2019 Haifeng Ms Thesis

    30/78

    20

    CHAPTER THREE

    BLUETOOTH-BASED PHONE ADAPTER

    As discussed in Chapter 2, one of the main difficulties for hearing impaired people to use

    telephones is caused by the noise introduced by the interaction between the phone and the

    hearing aid. Thus phone adapters are needed to route the audio signal directly to the hearing

    aid or CI processor.

    This chapter proposes a wireless phone adapter, which falls into the category of FM

    assistive listening devices. This adapter is based on Bluetooth technology, which is a new

    wireless transmission standard. The favorable features of this technology make the adapter

    superior to traditional ALDs.

    3.1 Introduction to Bluetooth

    Wireless technology has dramatically changed the way people interact with one another and

    receive information. Bluetooth, a short-distance wireless communication standard, aims at

    replacing cables and therefore making the world truly wireless. It defines a universal radio

    interface, through which devices within 10 meters can form short-distance ad hoc networks.

    By using the dynamic Bluetooth links between these mobile devices, a large number of new

    products and services will become possible.

    The physical carrier of Bluetooth connections is the 2.4-2.5 GHz band, which is

    available for public use in most countries. This band is divided into 79 1-MHz-width

    channels, and each channel is divided into 625-s-length time slots. The modulation scheme

  • 8/3/2019 Haifeng Ms Thesis

    31/78

    21

    used for one time slot of one channel is based on binary frequency shift keying (BFSK). A

    Bluetooth link, with one side called "the master device" and the other side called "the slave

    device", uses one channel in each time slot and jumps to another channel in the next time

    slot. The sequence of channels to be used is a pseudo-random sequence decided by the

    master device. The above scheme is called Frequency Hopping Code Division Multiple

    Access (FHCDMA). The device that initiates the ad hoc network becomes the master, while

    the other devices in the network are slaves. Two sides of a link alternately transmit and

    receive, i.e., there is only one-way traffic in one time slot. There are two kinds of links:

    synchronous connection-oriented (SCO) links and asynchronous connectionless (ACL) links.

    An SCO link is composed of evenly spaced pairs of time slots in the hopping sequence, with

    a 64Kb/s bit-rate; ACL links use the slots not reserved by SCO links. One Bluetooth link can

    support ACL links and up to 3 SCO links at the same time, and its theoretical maximum bit-

    rate is 1Mb/s [12][29].

    To ensure interoperability between Bluetooth devices, protocol layers and application

    profiles are defined in the Bluetooth Specification. Figure 3.1 illustrates the structure of a

    Bluetooth protocol stack. The layers above and including HCI (Host Controller Interface) can

    be viewed as software layers, while the rest of the stack can be viewed as hardware layers.

    Baseband and Link Manager layers implement the transport actions described in the previous

    paragraph, and provide a command interface, the HCI layer. The L2CAP (Logical Link

    Control and Adaptation Protocol) layer divides large packets from higher layers into small

    packets for lower-layer transmission and reassembles received small packets into large

    packets intelligible to higher layers; the L2CAP layer also supports multiple applications by

    assigning logical channels. Thus, TCS (Telephony Control Specification), SDP (Service

  • 8/3/2019 Haifeng Ms Thesis

    32/78

    22

    Discovery Protocol) and RFCOMM are unaware of physical communication details. The

    RFCOMM layer further emulates a serial port, so that many conventional applications can be

    used on it with no or minor changes. Application profiles for different scenarios are also

    included in the Bluetooth Specification to guide implementation, as two devices following

    the same profile have guaranteed compatibility [22].

    Figure 3.1. Structure of a Bluetooth stack.

    TCS=Telephony Control SpecificationSDP=Service Discovery Protocol

    L2CAP=Logical Link Control and Adaptation Protocol

    HCI=Host Controller Interface

  • 8/3/2019 Haifeng Ms Thesis

    33/78

    23

    Currently available products equipped with Bluetooth technology include cellphones,

    phone adapters, headsets, PC cards, modems, printers, and printer adapters, which are mainly

    used to replace cables. A more important potential of Bluetooth is auto-synchronization.

    Personal mobile devices, such as cellphones, PDAs and laptops, can form a mobile network

    and always keep updated with each other. Wherever the user goes, the personal devices can

    automatically find the local Bluetooth-enabled devices and make use of the information and

    services provided. It is projected that in the near future, by 2005, when there will be millions

    of Bluetooth-enabled products, the auto-synchronization function will bring great benefit to

    users. Places such as stores, cinemas and airports will only need a Bluetooth-enabled

    information access point to do all their business.

    The key factor for the success of this technology is interoperability. Manufacturers

    have to make their products comply with the Bluetooth Specification so that they can

    communicate with other products. A minimum requirement is to pass the qualification

    program administered by the Bluetooth Special Interest Group (SIG).

    3.2 Phone Adapter Design

    The phone adapter proposed in this thesis is an application of Bluetooth wireless technology

    for hearing impaired people. The Bluetooth link is used to transmit the audio signal. The

    architecture of the proposed adapter is illustrated in Figure 3.2.

    A pair of Bluetooth transceivers forms the wireless link, and each of them is

    connected to a host controller running the software protocol stack. The host controller can be

    a PC, a microcontroller, or a digital signal processor (DSP). The telephone signal is

    connected to the duplex audio interface of the master device. The audio output of the slave

  • 8/3/2019 Haifeng Ms Thesis

    34/78

    24

    device is connected to the hearing aid or the CI processor, while the input is connected to a

    lapel microphone.

    Figure 3.2. Architecture of the phone adapter.

    The slave device is first initialized to active slave mode, waiting for the connection

    request from the master. When a telephone call comes in, or the user makes a call, the master

    device sends out paging messages to find the slave device, and initiates a SCO link. After the

    connection is confirmed by both sides, the user can talk through this Bluetooth link without

    the need to hold the telephone handset or the need to connect the hearing aid or CI processors

    directly to the telephone jack. Since the audio signal is directly transmitted from the phone to

    the hearing aid or CI processor, environmental noise is disabled, and the user will be able to

    enjoy high speech quality even under extreme noisy situations (e.g., in a crowded restaurant,

    in a car).

    3.3 Hardware Design

    A prototype system was developed in this thesis. In this prototype, a pair of Ericsson's

    Bluetooth Starter Kits (EBSK) was used for the transceiver hardware. The PC works as the

  • 8/3/2019 Haifeng Ms Thesis

    35/78

    25

    host controller of the master device, which is stationary and connected to the telephone. A

    Motorola DSP56309 processor is used as the host controller of the slave device and provides

    portability of the user side.

    Figure 3.3. Hardware design.

    Figure 3.3 shows the block diagram of the hardware design of the slave device. The

    host I/O port of the DSP56309 Evaluation-Module (EVM) board is programmed to send HCI

    commands to EBSK. A sequence of 9 HCI protocol commands is implemented in this device

    in assembly language. Their function is to reset the EBSK, to set basic transmission settings,

    and to put the EBSK in an active slave mode. In order for this prototype to be stand-alone,

    the assembly program is written into the flash memory on the EVM board and is executed

    when the DSP is reset. To meet the electrical requirement of the host interface of the EBSK,

    a signal amplifying and shifting circuit was designed based on a LM318N chip (National

  • 8/3/2019 Haifeng Ms Thesis

    36/78

    26

    Semiconductor), and a voltage converter circuit was built upon a LMC7660IN chip (National

    Semiconductor) to provide a negative voltage for the amplifier. Figure 3.4 shows a

    photograph of the portable slave-side device.

    Figure 3.4. Phone adapter prototype.

    3.4 Software Design for Wireless Link

    The ultimate goal of this project is not only to design a phone adapter, but an ALD that can

    receive audio signals from all Bluetooth-enabled sources, such as TVs, stereos, and

    computers. An audio source with a Bluetooth transceiver should be able to find this device

    with a function description and setup an SCO link, all through the procedures defined in the

    Bluetooth Specification. To achieve this interoperability, the host controller of our device

    needs to support the L2CAP protocol, the RFCOMM protocol, the SDP protocol, and one

    application profile defined in the specification.

  • 8/3/2019 Haifeng Ms Thesis

    37/78

    27

    The Headset Profile is most similar in function to our phone adapter. It defines the

    procedure of setting up a Bluetooth audio link between an audio-gateway device and a

    headset device. In our phone adapter, the telephone-side device corresponds to the audio

    gateway, and the user-side device corresponds to the headset. The transceiver hardware is

    still a pair of EBSK, but the host controllers of both sides are two computers with Ericsson

    Bluetooth PC Reference Stack loaded. This software stack is a COM-server (Component

    Object Model) in the form of an executable file. It contains HCI, L2CAP, RFCOMM, and

    SDP layers, and provides a programming interface [10]. Application programs, written in

    C++, communicate with the protocol layers by sending commands and receiving event-

    messages. These programs emulate the operations of the audio gateway and the headset, as

    defined in the Bluetooth Specification.

    Figure 3.5. Emulating the Headset Profile.

  • 8/3/2019 Haifeng Ms Thesis

    38/78

    28

    The structure of the prototype system is shown in Figure 3.5. For practical reasons,

    the user-side device should be portable and cannot be hosted by a computer. Here we make

    the assumption that the software portion of this prototype can be implemented using a

    microcontroller or a DSP.

    Figure 3.6. Message flow (adapted from [10][29]).

  • 8/3/2019 Haifeng Ms Thesis

    39/78

    29

    Figure 3.6 shows the message flow of setting up a headset link. The application

    program first initializes itself by registering at the protocol layers, and starts by writing the

    SDP service record as a headset. When the remote audio gateway inquires the function

    description, SDP answers with the information written by the program. After answering the

    L2CAP and RFCOMM connection requests properly, a virtual serial-port link is established

    between the two devices. "RING", an AT command defined in [13], is sent from the audio

    gateway to the headset, and the program answers with "AT+CKPD=200", another AT

    command [11], which says that the incoming call is accepted. Then the audio gateway

    initializes the SCO link that carries the speech signal, and the process is finalized.

    The application programs were developed based on sample programs provided with

    the Ericsson stack. Those programs were modified for our application. When both host

    controllers used our emulating programs, the EBSKs completed the whole process shown in

    Figure 3.6.

    3.5 Testing

    In order to evaluate the effectiveness of the wireless transmission on the quality of audio

    signal, three CI users were invited to talk through the phone-adapter prototype. All users are

    fitted with the MED-EL CIS-LINK processor. The three CI subjects were using their daily

    MED-EL processor with the CIS strategy running at 1000-2000 pulses/second. The audio I/O

    of the portable device was split to two mono-jacks, one leading to the audio input of the CI

    processor and the other leading to a microphone. The user listened through his CI processor

    and talked to the microphone. The stationary side of the prototype was connected to different

  • 8/3/2019 Haifeng Ms Thesis

    40/78

    30

    audio sources, including a handset with a person talking, a sound-card jack of a computer,

    and audio wires taken from a normal telephone.

    Good quality was reported by the CI users with a person talking through the handset,

    when both sides were within reasonable distance inside the lab (the lab is 7 meters long and 6

    meters wide). When the user, holding the portable device, walked outside the room and

    closed the door (which has a metal frame), the signal was substantially faded.

    In order to verify the interoperability of the software design, the user-side device (the

    virtual headset) was tested with an Ericsson T28W cellphone coupled to a DBA10 adapter,

    which supports the audio-gateway function in the Headset Profile. When the headset program

    was tested with T28W+DBA10, the process hung at a minor step before the "RING"

    message. The last message from the remote device was a modem-status-change event on the

    virtual connection, and that message was not answered properly. This could be due to

    compatibility problems between current Bluetooth products, and further work on this adapter

    might need technical support from the manufacturer.

  • 8/3/2019 Haifeng Ms Thesis

    41/78

    31

    CHAPTER FOUR

    BANDWIDTH EXTENSION OF TELEPHONE SPEECH

    In Chapter 2, we have seen that both codebook-mapping algorithms and linear-estimation

    algorithms have their advantages and disadvantages, and that linear estimation requires much

    less memory and computation. This chapter proposes a linear estimation method based on

    LSF parameters, combined with speech classification techniques to overcome drawbacks of

    generic linear estimation. Section 4.1 provides a discussion about the algorithm proposed in

    [6], which motivated the proposed method. Section 4.2 provides a detailed description of the

    improved algorithm. Section 4.3 evaluates this algorithm using several objective measures.

    4.1 Linear Estimation Method for Bandwidth Extension

    In this section, we continue the discussion in Section 2.3. By analyzing the advantages and

    disadvantages of the linear estimation method proposed in [6], we provide the theoretical

    basis of the proposed algorithm of this thesis. The operation of linear estimation is shown in

    Equation 2.3, which is reproduced here:

    xMy

    = (4.1)

    where the vector x

    is the set of parameters representing the narrow-band spectral envelope;

    vector y

    is the set of parameters representing the narrow-band spectral envelope; M is the

    matrix composed of estimation parameters.

    The choice of parameters is the critical factor in the linear-estimation performance.

    Among the parameters that can describe the spectral envelope, LSF is a good candidate.

  • 8/3/2019 Haifeng Ms Thesis

    42/78

    32

    Initially proposed and proved in [27][30], it is a set of parameters equivalent to LPC

    coefficients. Its definition starts from the following functions:

    1

    1

    )1(

    11

    )(

    1)(+

    +++=

    +++=

    zazazzB

    zazazA

    M

    MM

    MM

    (4.2)

    where 1a , 2a , , Ma are the LPC coefficients, and A(z) is the transfer function of the linear-

    prediction error filter. Further let us define the following two functions:

    )()()(

    )()()(

    zBzAzQ

    zBzAzP

    +=

    = (4.3)

    Given P(z) and Q(z), we can calculate A(z), and hence the LPC coefficients, as follows:

    2

    )()()(

    zQzPzA

    += (4.4)

    It was proven in [27] that P(z) and Q(z) can be factorized as follows:

    ( ) ( )

    ( ) ( )

    ( ) ( )

    ( )

    =

    =

    =

    =

    +=

    +=

    ++=

    +=

    Mi

    i

    Mi

    i

    Mii

    Mi

    i

    zzzQ

    zzzzP

    numberoddanisMIf

    zzzzQ

    zzzzP

    numberevenaisMIf

    ,3,1

    21

    1,4,2

    212

    1,3,1

    211

    ,4,2

    211

    cos21)(

    cos211)(

    ,

    cos211)(

    cos211)(

    ,

    (4.5)

    where

  • 8/3/2019 Haifeng Ms Thesis

    43/78

    33

    portion of the spectrum; a scattered distribution represents the low magnitude portion; a close

    pair of LSF values represents a peak in the spectrum.

    LSF has some desirable properties: when LSF values fall in the range (0,), the

    recovered LPC filter has guaranteed stability; local errors of LSF values only cause local

    spectral distortion. Therefore, linear estimation based on LSF values is more tolerant to

    estimation errors, as a single error cannot harm the whole spectral envelope. Linear-

    estimation algorithms using LSF were proposed in [6][21], and shown to yield superior

    performance to codebook methods.

    In [6], the high-band LSFs are estimated from low-band LSFs by the following

    equation:

    Aff lh = (4.6)

    Where hf and lf are the 81 vectors of the high-band LSFs and the low-band LSFs

    respectively. The 88 matrix A is calculated from training data by the following equation:

    ( ) hT

    ll

    T

    l FFFFA1

    = (4.7)

    where the matrices lF and hF are obtained from training data. The rows of lF and hF

    consist of samples of lf or hf respectively. For each frame, the LSFs of the narrow-band

    signal are divided by 2 and then fed into Equation 4.6 as lf . The estimated hf is combined

    with lf , and the whole set is used as wide-band LSFs to generate the wide-band spectral

    envelope. To improve the performance of linear estimation, speech frames are divided into 4

    groups based on the first two reflection coefficients, k1 and k2, as shown in the Table 4.1.

    Four matrices are trained and used separately for each group.

  • 8/3/2019 Haifeng Ms Thesis

    44/78

    34

    Speech class Reflection coefficients

    Class 1 k1-0.7 k20.55Class 2 k10.55

    Class 4 k1-0.7 k20.55

    Table 4.1.--Classification of [6] algorithm.

    This algorithm makes the assumption that the frequency ranges (0,/2) and (/2,)

    have the same number of LSF values in wide-band speech. For example, when the LPC order

    is 24, there are always 12 LSFs in (0,/2) and 12 LSFs in (/2,). This assumption is not

    true for all frames, and the actual distribution is (11,13) or (13,11) with a probability of

    around 50% in our training by TIMIT database. Particularly, the high-frequency consonants,

    which are of special interest to our problem, have mainly the distribution of (11,13). This

    reflects the fact that the speech energy is concentrated in the high-band. To compensate for

    this drawback, the proposed algorithm artificially disperses the LSF values in lf before

    feeding it into the linear estimation, for consonant frames. The classification component of

    the proposed algorithm provides a group of fricative-consonant frames, and it is used as the

    criterion of applying this operation. The implementation can be found in Section 4.2.2.

    Figure 4.1 shows the theoretical upper limit of [6] algorithm performance, as the

    output signal is synthesized using the original wide-band spectral envelope. The recovered

    consonants are weak due to lack of energy in the narrow-band signal. When the original

    speech is transmitted in a telephone line, consonants lose most of their energy in the high-

    band. When we do the LPC analysis on a narrow-band consonant frame, this lack of energy

    is reflected in the residual signal. The residual of a consonant frame was found to be about

    20dB lower in magnitude than that of a vowel frame. Therefore, an amplifying operation is

  • 8/3/2019 Haifeng Ms Thesis

    45/78

    35

    needed for residual signals of consonant frames. The implementation can be found in Section

    4.2.1.

    Figure 4.1. Lack of energy in consonant frames. (a) Original wide-band sentence

    spectrogram. (b) Sentence synthesized by original envelopes and residual spectrum folding.

    Another shortcoming is the classification by fixed thresholds. In real life, different

    speakers have different distributions on the k1-k2 plane and therefore have different

    thresholds. Ideal classification criteria should be adaptive to speakers by making use of the

    information from other frames in the same sentence. In this thesis, we propose a

    classification strategy based on Hidden Markov Models (HMM). This statistical method

    makes thresholds soft, and therefore is more robust when working on speakers with various

    voice characteristics. The derivation and implementation can be found in Section 4.2.3.

  • 8/3/2019 Haifeng Ms Thesis

    46/78

    Figure 4.2. Diagrammatic representation of the bandwidth-extension algorithm.

  • 8/3/2019 Haifeng Ms Thesis

    47/78

    37

    4.2 Proposed Algorithm for Bandwidth Extension

    This section describes the implementation details of the proposed algorithm. The overall

    system flow is shown in Figure 4.2.

    The narrow-band speech signal, with an 8kHz sampling rate, is processed on a frame-

    by-frame basis using a 20-ms Hanning window and a 10-ms overlap between adjacent

    frames. The windowed speech frame is analyzed using an LPC analyzer of order 12. The

    output of the analyzer goes to three branches:

    1. Residual extension. The narrow-band residual signal passes through a spectrum

    folding function where the sample rate is doubled and the low-band spectrum is

    copied to the high-band. Additional amplification is then applied to consonant

    frames, and the resulting sequence is sent out as the wide-band residual.

    2. Envelope extension. Narrow-band LPC coefficients are converted into 12

    narrow-band LSF values. They are divided by 2, pre-processed and then fed into

    the linear estimation, as shown in Equation 4.6. After eliminating invalid

    numbers, the estimated 12 high-band LSF values, together with the low-band LSF

    values, are converted back into order-24 wide-band LPC coefficients.

    3. Classification of speech frames. Parameters are extracted from a narrow-band

    speech frame. Using this information, a classification decision is made based on a

    HMM model, and this decision is used in both of the previous branches.

    Finally, the wide-band residual passes through the LPC synthesizer (of order 24)

    constructed by the wide-band LPC coefficients. The output is the desired wide-band speech,

    sampled at the rate of 16kHz, with recovered high-band information.

  • 8/3/2019 Haifeng Ms Thesis

    48/78

    38

    Sections 4.2.1, 4.2.2 and 4.2.3 provide detailed description of these three branches

    respectively.

    Figure 4.3. Spectrum folding.

    4.2.1 Residual Extension

    The residual extension module implements the spectrum folding method proposed in [20].

    The sample rate is doubled from 8kHz to 16kHz. The odd-index sample values of the wide-

    band residual signal are copied from the narrow-band residual, and the even-index sample

    values are zeroed. This process is shown in the following equation:

  • 8/3/2019 Haifeng Ms Thesis

    49/78

    39

    Nkky

    kxky

    ...,,2,10)2(

    )()12(

    ==

    = (4.8)

    where y(n) is the wide-band residual and x(n) is the narrow-band residual. This time-domain

    operation is equivalent to folding the (0, 4000) Hz spectrum to (4000, 8000) Hz in frequency

    domain, as illustrated in the Figure 4.3.

    Since the narrow-band residual has a flat spectrum, the spectrum is still flat after the

    folding. For unvoiced frames, the narrow-band residual is a noise-like sequence, and the

    wide-band residual inherits the same property; for voiced frames, the harmonic peaks in the

    narrow-band spectrum are copied to the high-band. As has been discussed in Section 2.2.2,

    the harmonic structure is disrupted at 4kHz, but this drawback does not cause perceivable

    artifact in the output speech.

    As discussed in Section 4.1, spectrum folding is not enough for consonant frames, as

    their residual signal needs to be amplified to the energy level of a real wide-band consonant.

    In order for the algorithm to be adaptive to different speakers, we use the average energy of

    previous vowel residuals in the same sentence as the reference energy level. The

    vowel/consonant decisions are provided by the classification branch. The amplification

    process is shown in the following equation:

    8.0

    2

    1

    2

    )(

    )(1

    )()(

    =

    =

    n

    N

    i n

    i

    ne

    nvN

    nene (4.9)

    where e(n) is the consonant residual sequence, )(nvi is a vowel residual sequence, and N is

    the number of previous vowel frames. The parameter 0.8 is an empirical number, which

    represents the trade-off between the need to amplify the signal and to maintain the energy

  • 8/3/2019 Haifeng Ms Thesis

    50/78

    40

    variation of different consonants. It was proven to be valid by experimental results, and

    examples can be found in Section 4.3.3.

    Figure 4.4. Envelope extension. Markers indicate LSF values.

    4.2.2 Envelope Extension

    The envelope extension module implements a linear estimation method using LSF. The

    estimation and training equations are the same as [6] algorithm, as shown in equations 4.6

    and 4.7 respectively. Figure 4.4 shows an example of the spectral envelopes before and after

  • 8/3/2019 Haifeng Ms Thesis

    51/78

    41

    the extension, together with the distribution of LSF values. Please note that in the x-axis of

    the upper figure corresponds to 4kHz, while in the lower figure corresponds to 8kHz.

    Among the 24 LSF values describing the wide-band LPC spectral envelope, the first

    12 LSFs come from the narrow-band signal and determine the envelope in the low-band.

    They are calculated by dividing the narrow-band LSF values by 2, therefore their mutual

    location remains unchanged and they still describe the same envelope. This fact can be seen

    in Figure 4.4 where the first half of the lower envelope is a compressed copy of the upper

    envelope. The second half of the 24 LSF values is generated by the linear estimation, and

    determines how accurate we can estimate the high-band information. The estimation errors

    cause spectral distortion in the output speech.

    As discussed in Section 4.1, high-frequency consonant frames mainly have the LSF

    distribution of (11,13). Therefore, when we train the estimation matrix for the consonant

    group using Equation 4.7, most rows of the matrix lF have a number exceeding /2. But the

    vector lf we feed into the estimation is composed of numbers limited to (0, /2), and this

    mismatch between working data and training data substantially affects the performance. The

    results include invalid output LSF values exceeding and severe spectral distortion at

    highest frequencies.

    The solution proposed in our algorithm is to artificially disperse the LSF values in lf

    for consonant frames, according to the following equation:

    0825.1,5.1)12( =< lll fffIf (4.10)

    where 1.5 and 1.0825 are empirical parameters. This operation make the maximum number

    in lf approach or exceed /2, emulating the situation in a real wide-band frame. As a by-

  • 8/3/2019 Haifeng Ms Thesis

    52/78

    42

    product, some distortion is introduced to the low-band: the magnitude is decreased and the

    spectral peaks are moved towards high frequencies. Fortunately, this distortion is negligible

    for a consonant. The following figure compares the original wide-band envelope and the

    envelopes estimated with and without this operation, all with the distribution of LSF values.

    Figure 4.5. Effect of artificial dispersion.

    The last step of the envelope extension module is eliminating invalid values.

    According to the definition, LSF values must fall in the range of (0, ). The possible errors in

  • 8/3/2019 Haifeng Ms Thesis

    53/78

    43

    this linear estimation are values over . In this case, all the wide-band LSFs are scaled down

    proportionally, as shown in the following equation:

    )24(/05.3,05.3)24( vvvvIf => (4.11)

    where v is the vector of LSF values. The reason for setting the upper limit to 3.05 instead of

    is that a LSF very close to may cause whistling noise. The occurrence of invalid values

    means estimation failure, and the purpose of this last operation is to smooth errors out. In

    fact, when a good classification technique is used, such errors are very rare, and are not

    perceivable due to the overlap between adjacent frames.

    4.2.3 Classification of Speech Frames

    From Sections 4.2.1 and 4.2.2, we have seen that classification decisions are used in training,

    estimation-matrix choice, residual adjustment, and the pre-processing of linear estimation.

    The accuracy of this classification of speech frames is critical to the performance of the

    whole algorithm.

    The proposed classification strategy in this thesis is based on a Hidden Markov

    Model (HMM) [26]. Speech frames are divided into four classes: class 1 includes vowels and

    semi-vowel consonants; class 2 includes nasal consonants; class 3 includes silence and weak

    consonants (stops); class 4 includes fricative consonants. Class 4 is of special interest in

    bandwidth extension. The definition of the four classes is given below in terms of the

    phonemes of each class (TIMIT phonetic symbols are used):

    Class 1={'aa' 'aw' 'ay' 'ah' 'ao' 'oy' 'ow' 'uh' 'l' 'r' 'w' 'el' 'iy' 'ih' 'eh' 'ey' 'ae' 'uw' 'ux'

    'er' 'ax' 'ix' 'axr' 'ax-h' 'b' 'dx' 'q' 'v' 'dh' 'y'}

    Class 2={'m' 'n' 'ng' 'em' 'en' 'eng' 'nx'}

  • 8/3/2019 Haifeng Ms Thesis

    54/78

    44

    Class 3={'g' 'p' 'k' 'hh' 'hv' 'pau' 'epi' 'h#' 'tcl' 'kcl' 'bcl' 'gcl' 'pcl' 'dcl'}

    Class 4={'jh' 'ch' 'sh' 'zh' 'd' 't' 's' 'z' 'f' 'th'}

    Since each speech frame belongs to one of the four classes, a sentence can be viewed

    as a sequence of states, with each state indicating the current class. This state sequence is

    hidden, and the purpose of our classification strategy is to find out this hidden sequence

    based on information extracted from the speech signal.

    Three parameters are chosen as the basis of classification. The first parameter is k1,

    the first reflection coefficient of the speech frame. The second parameter is the zero-crossing

    rate of the speech frame, defined as follows:

    ( ) ( )=

    =N

    m

    mxsignmxsignN

    Z1

    )1()(2

    1 (4.12)

    where

  • 8/3/2019 Haifeng Ms Thesis

    55/78

    45

    where T is the number of frames in the sentence. Further considering

    ],...,,[

    ],...,,,,...,[],...,,,...,[

    21

    21212121

    T

    TTTT

    OOOP

    OOOqqqPOOOqqqP = (4.15)

    where the denominator is a constant for a given sentence, an alternative expression of the

    classification problem is given by the following equation.

    { }],...,,,,...,[maxarg},...,,{ 2121

    4,3,2,1,...,21

    21

    TTqqq

    T OOOqqqPqqqT

    = (4.16)

    In order to evaluate the probability in Equation 4.16, we need to build a statistical

    model to describe the random processes },...,,{ 21 Tqqq and },...,,{ 21 TOOO . This HMM

    model is specified by the following parameters [26]:

    1. The initial state distributions:

    4,3,2,1,][ 1 === iiqPi (4.17)

    2. The state transition probabilities:

    }4,3,2,1{,,][ 1 === + jiiqjqPa ttij (4.18)

    3. The observation distributions in each state (here we use a discrete distribution):

    4,3,2,1,][)( ==== iiqOOPOb tti (4.19)

    These parameters are calculated from training sentences in the TIMIT database, and

    the phonetic description files in TIMIT are used to provide the real classification decisions in

    training. More details about training will be given later.

    After the model is built, i.e., after the parameters are calculated, the solution of

    Equation 4.16 can be computed by the Viterbi algorithm, which is computationally efficient.

    First we define a quantity )(it as follows:

    { }4,3,2,1,],...,,,,,...,,[max)( 21121

    4,3,2,1,..., 121

    ===

    iOOOiqqqqPi tttqqq

    tt

    (4.20)

  • 8/3/2019 Haifeng Ms Thesis

    56/78

    46

    The Viterbi procedure to find the best state sequence is given below [26]:

    1. Initialization:

    4,3,2,1,)()( 11 == iObi ii (4.21)4,3,2,1,0)(1 == ii (4.22)

    2. Recursion (for t=2, 3, , T):

    4,3,2,1,)(])([max)( 14,3,2,1

    == =

    jObaij tjijti

    t (4.23)

    4,3,2,1,])([maxarg)( 14,3,2,1

    == =

    jaij ijti

    t (4.24)

    3. Termination:

    )]([maxarg4,3,2,1

    iq Ti

    T =

    = (4.25)

    4. Backtracking:

    1...,,2,1,)( 11 == ++ TTtqq ttt (4.26)

    In practice, all the multiplications above are implemented by additions in the log

    domain, because direct multiplications of a large number of small values can exceed the

    resolution limit of any computer. The computational complexity of this procedure is on the

    order of 16T (T is the number of frames in the sentence), in addition to the extraction of {k1,

    Z, E} for each frame. This complexity is much more than the stationary classification

    technique in Table 4.1, but is still feasible in real time. However, in a real-time

    implementation, we cannot make use of the whole sentence. In practice, we have found that

    the performance of the Viterbi algorithm based on 30 frames is close to that based on a whole

    sentence, and therefore a real-time implementation can use 30 as the decision size. The delay

    caused by this algorithm will be 15 20ms = 0.3 second, which is still acceptable.

  • 8/3/2019 Haifeng Ms Thesis

    57/78

    47

    The costs discussed in the previous paragraph are overshadowed by the robustness

    introduced by the new bandwidth-extension algorithm. This robustness is shown when

    dealing with abnormal frames and abnormal speakers. A single frame with unusual

    parameters can be put into the correct group as the decision for this frame takes into

    consideration other nearby frames. Also, other classification techniques based on fixed

    thresholds will suffer when used for a speaker with different parameter distribution from the

    training data. On the other hand, the proposed classification technique has soft thresholds and

    is more tolerant to such abnormal input.

    The observation distribution in each state is a three-dimension discrete distribution

    and is defined as follows:

    4,3,2,1,],,[],,1[][)( ======== iiqzyxPiqZEkPiqOOPOb tttti (4.27)

    where {x, y, z} is the quantization result of {k1, E, Z}, computed as follows:

    =

  • 8/3/2019 Haifeng Ms Thesis

    58/78

    48

    }34,...,3,2{,,,4,3,2,1,

    12

    )1,,()1,,(

    12

    ),1,(),1,(),,1(),,1(

    2

    ),,(),,(

    =++

    +

    ++++++=

    zyxizyxbzyxb

    zyxbzyxbzyxbzyxbzyxbzyxb

    ii

    iiiiii

    (4.29)

    4.3 Evaluation and Results

    Subjective and objective measures are commonly used for evaluating speech-processing

    strategies. To evaluate this bandwidth extension algorithm, the ideal criterion is obviously a

    subjective test, in which hearing-impaired people listen to the narrow-band speech and

    recovered wide-band speech, and make judgment about the quality. But in a subjective test,

    the result can be affected by many factors such as the degree of hearing loss, users'

    experience in using telephones, and the choice of sentences. The results of a subjective test

    often cannot be reproduced and therefore are not suitable for comparing different signal-

    processing algorithms.

    Objective measures, on the other hand, can be clearly defined and easily repeated.

    Two algorithms can be compared fairly by calculating an objective measure using the same

    speech database. However, the results of objective measures are often not highly correlated

    with those of subjective measures. When a signal is said to have the lowest distortion

    according to an objective test, it may not have the highest intelligibility and naturalness and

    be preferred by human listeners. We still do not fully understand how human ears process

    speech signals and therefore cannot define the optimal objective measure to model speech

    quality.

    Among current objective measures, the Itakura-Saito (IS) spectral distance and Log

    Likelihood Ratio (LLR) are widely used, as they have relatively modest correlation with

  • 8/3/2019 Haifeng Ms Thesis

    59/78

    49

    speech intelligibility [25]. We use both of them to evaluate the proposed bandwidth extension

    algorithm. Further more, a frequency-domain signal-to-noise ratio (SNR) is defined to

    evaluate the envelope-extension performance. An accuracy measure is defined to measure the

    classification performance. The definitions and results of these four measures are provided in

    Section 4.3.2. Sentence examples are presented in Section 4.3.3.

    4.3.1 Test Material

    The TIMIT database is a standard speech database widely used by speech-processing

    researchers, and it is the source of all the training sentences and testing sentences in this

    thesis [18]. The original sentences are all wide-band speech with a 16kHz sampling rate. The

    TIMIT sentences were low-pass filtered and then down-sampled to generate the narrow-band

    TIMIT sentences (4kHz bandwidth). Each objective measure is calculated for male-speaker

    sentences, female-speaker sentences, and mixed sentences. In the male or female case, the

    training data is 250 sentences, and the testing data is 10 sentences; in the mixed case, the

    training data is 250 male sentences and 250 female sentences, and the testing data is 10 male

    sentences and 10 female sentences. There is no overlap between the training data and the

    testing data.

    4.3.2 Objective Measures

    The four objective measures are discussed below followed by the test results.

  • 8/3/2019 Haifeng Ms Thesis

    60/78

    50

    Itakura-Saito (IS) distance measureIS and LLR are spectral distance measures based on the all-pole LPC model. They

    compare the original signal and the distorted signal on a frame-by-frame basis. The IS

    distance is the most widely used measure of spectral distortion, as it takes into

    consideration both the spectral envelope and the frame energy. The IS distance of one

    frame is computed as follows:

    1log10

    +

    =

    Txxx

    Tyyy

    Tyyy

    Tyxy

    IS

    aRa

    aRa

    aRa

    aRad (4.30)

    where xa and xR are the linear prediction coefficient vector and the autocorrelation

    matrix of the original speech frame respectively; ya and yR are the linear prediction

    coefficient vector and the autocorrelation matrix of the estimated speech frame to be

    evaluated.

    Since the IS measure changes when the energy of the target sentence changes, the

    original wide-band sentence and the estimated wide-band sentence are first normalized to

    have the same total energy. Then IS distance values are calculated for each pair of 20ms

    frames taken from the two signals. The results are averaged over all the frames, with the

    highest 10% of the IS distance values discarded to smooth out meaningless large

    numbers. The average IS distance is calculated as follows:

    =%909.0

    1

    lower

    ISIS dN

    d (4.31)

    where N is the number of frames.

    Figure 4.6 shows the performance comparison between the [6] algorithm and the

    proposed algorithm. A denotes the [6] algorithm; C denotes the proposed algorithm

  • 8/3/2019 Haifeng Ms Thesis

    61/78

    51

    in this thesis, with the HMM classification strategy; B denotes the proposed algorithm

    without the HMM part, but using adjusted classification thresholds given in Table 4.2. In

    all the cases, the average IS distance is calculated over all testing sentences. The two

    versions of the proposed algorithm have consistently lower speech distortion values,

    which corresponds to higher speech quality.

    Figure 4.6. IS distance comparison of the algorithms.

    Speech class Reflection coefficients

    Class 1 k1-0.7 k20.55Class 2 k10.55

    Class 4 k1-0.3 k20.55

    Table 4.2.--Adjusted thresholds used in B.

  • 8/3/2019 Haifeng Ms Thesis

    62/78

    52

    Log Likelihood Ratio (LLR)As the origin of IS measure, LLR only involves the distortion of LPC spectral envelope

    and has a much clearer physical significance. The LLR measure of one speech frame is

    computed as follows:

    =

    Txxx

    Tyxy

    LLR

    aRa

    aRad 10log (4.32)

    where xa and xR are the linear prediction coefficient vector and the autocorrelation

    matrix of the original speech frame respectively; ya is the linear prediction coefficient

    vector of the estimated speech frame to be evaluated. In the time domain, the

    denominator can be viewed as the optimal prediction error energy, and the numerator can

    be viewed as the prediction error energy using the estimated LPC coefficients. From

    Wiener filtering theory, the denominator is the minimum possible energy value and is

    only achieved when using the perfect LPC coefficients. Therefore, the numerator is

    always larger than the denominator, and the LLR value is always non-negative. The

    larger the LLR value is, the more different the estimation LPC coefficients are from the

    real ones, in the error-energy sense. In the frequency domain, the LLR can be

    reformulated as follows [25]:

    +=

    +

    2)(

    )()(1log

    2

    10

    d

    eA

    eAeAd

    j

    x

    j

    y

    j

    x

    LLR (4.33)

    where )(xA and )(yA are the LPC spectrum of the original speech frame and the

    generated speech frame respectively. Equation 4.33 can be viewed as a weighted sum of

    the spectral envelope distortion at all frequencies with high weighting put in the formant

  • 8/3/2019 Haifeng Ms Thesis

    63/78

    53

    frequencies of the original signal. Therefore, LLR mainly models the mismatch between

    the formants of the two speech frames.

    LLR values are calculated for each pair of 20ms frames taken from the two signals.

    The results are averaged over all the frames, with the highest 10% of the LLR values

    discarded to smooth out meaningless large numbers. The average LLR is calculated as

    follows:

    =%909.0

    1

    lower

    LLRLLR dN

    d (4.34)

    where N is the number of frames.

    Figure 4.7. LLR measure comparison of the algorithms.

  • 8/3/2019 Haifeng Ms Thesis

    64/78

    54

    Figure 4.7 gives the bar plots of the average LLR values of the [6] algorithm and the

    proposed algorithm for male, female, and mixed cases. The proposed algorithm shows a

    consistently lower speech distortion.

    Frequency-domain signal-to-noise ratio (SNR)The frame-based segmental SNR is another popular method for evaluating speech

    quality. It is defined as the ratio of the signal energy to the noise energy in decibels.

    Because the phase information is lost when the bandwidth-extension algorithm processes

    the residual signal, the time-domain SNR is not suitable for measuring the performance.

    As an alternative, we propose the frequency-domain SNR (denoted as SNRf), which is

    defined in the following equation:

    ( )

    =

    deAeA

    deASNRf

    j

    y

    j

    x

    j

    x

    2

    2

    10

    )()(

    )(log10 (4.35)

    where )(xA and )(yA are the LPC spectra of the original speech frame and the estimated

    speech frame respectively. The frequency-domain SNR represents the distortion of the

    LPC spectral envelope. The difference between this measure and LLR is that with the

    SNRf measure the distortions at all frequency components are treated equally.

    SNRf values are calculated for each pair of 20ms frames taken from the two signals.

    The results are averaged over all the frames, and the average SNRf is calculated as

    follows:

    =

    =N

    n

    nSNRfN

    SNRf1

    )(1

    (4.36)

    where N is the number of frames.

  • 8/3/2019 Haifeng Ms Thesis

    65/78

    55

    Figure 4.8. Frequency-domain SNR comparison of the algorithms.

    Figure 4.8 shows the bar chart comparing the frequency-domain SNR values of the

    estimated wide-band speech produced by the two algorithms. The proposed algorithm

    shows a 2dB SNRf gain over the [6] algorithm. This attributed to the improvement in the

    envelope extension branch of the algorithm.

    Classification accuracy measureAs has been discussed in Section 4.2, the accuracy of the classification decisions is

    critical to the performance of the whole algorithm, and the accuracy of identifying

    fricative-consonant frames is particularly important. Therefore, when calculating this

  • 8/3/2019 Haifeng Ms Thesis

    66/78

    56

    measure, we divide speech frames into only 2 groups: fricative-consonant frames and

    other frames. The accuracy is defined as t