Digital Speech Processing for Cochlear Implants

ORL Managing Editor W. Arnold, München

Separatum Publisher: S.Karger AG, Basel Printed in Switzerland

Norbert Dillier Hans Bögli Thomas Spillmann

Department of Otorhinolaryngology, University Hospital, and the Institute for Biomedical Engineering, University of Zürich and Swiss Federal Institute of Technology, Zürich, Switzerland

Key Words Auditory prosthesis Digital signal processing Cochlear implants

Original Paper

ORL 1992;54:299-307

Digital Speech Processing for Cochlear lmplants

Abstract A rather general basic working hypothesis for cochlear implant research might be formulated as follows. Signal processing for cochlear implants should care-fully select a subset of the total information contained in the sound signal and transform these elements into those physical stimulation parameters which can generate distinctive perceptions for the listener. Several new digital pro-cessing strategies have thus been implemented on a laboratory cochlear implant speech processor for the Nucleus 22-electrode system. One of the approaches (PES, pitch excited sampler) is based on the maximum peak chan-nel vocoder concept whereby the spectral energy of a number of frequency bands is transformed into appropriate electrical stimulation parameters for up to 22 electrodes using a voice pitch synchronous pulse rate at any electrode. Another approach (CIS, continuous interleaved sampler) uses a maximally high pitch-independent stimulation pulse rate on a selected number of elec-trodes. As only one electrode can be stimulated at any instance of time, the rate of stimulation is limited by the required stimulus pulse widths (as deter-mined individually for each subject) and some additional constraints and parameters which have to be optimized and fine tuned by psychophysical measurements.

Evaluation experiments with 5 cochlear implant users resulted in signifi-cantly improved performance in consonant identification tests with the new processing strategies as compared with the subjects own wearable speech pro-cessors whereas improvements in vowel identification tasks were rarely observed. The pitch-synchronous coding (PES) resulted in worse performance compared to the coding without explicit pitch extraction (CIS). A great por-tion of the improvement is probably due to better transmission of sibilant and fricative (and to a lesser extent place of articulation) information.

Introduction

Research and development of optimized ways to re-store auditory sensations and speech recognition for pro-foundly deaf subjects have concentrated in recent years very much on investigations of signal processing strate-

gies. A number of technological and electrophysiological constraints imposed by the anatomical and physiological conditions of the human auditory system have to be con-sidered. One basic working hypothesis for cochlear im-plants is the idea that the natural Eiring pattern of the auditory nerve should be as closely approximated by elec-

Dr. N. Dillier Department of Otorhinolaryngology University Hospital CH-8091 Zürich (Switzerland)

1992 S. Karger AG, Basel 0301-1569/92/0546-0299 $2.75/0

Opt. C H - 1

INPUT

et... 1 CH-3

--> CH-4

CLOCK

V

MUX

CPU

P CH-N

CH-1

INPUT -› ADC

DATA MEM.

PROG. MEM.

ANALOG SIGNAL PROCESSING DIGITAL SIGNAL PROCESSING

Fig. 1. Analog and digital signal processing for cochlear implants. In the example of analog signal processing, 4 band-pass filters are used to generate stimulation signals for 4 electrodes similar to the scheme employed in the Symbion/Ineraid system. Digital signal processing details are hidden in the structure of the program and the selected algorithms and stimulus parameters. ADC = Analog-digital converter; CPU = central processing unit; MEM = memo-ry; MUX = multiplexer; CH = channel.

trical stimulation as possible. The central processor (the human brain) would then be able to utilize natural (`pre-wired' as well as learned) analysis modes for auditory per-ception. An alternative hypothesis is the Morse code idea, which is based an the assumption that the central proces-sor would be as flexible as to interpret any transmitted stimulus sequence after proper training and habituation.

Both hypotheses have nev er reale been tested for obvious reasons. On the one hand, it is not possible to reproduce the activity of 30,000 individual nerve fibers with current electrode technology. In fact, it is even ques-tionable whether it is possible to reproduce the detailed activity of a single auditory nerve fiber via artificial stim-ulation. There are a number of fundamental physiological differences in firing patterns of acoustically versus electri-cally excited neurons, which are hard to overcome. Spread of excitation within the cochlea and current sum-mation are other major problems of most electrode con-figurations. On the other hand, the coding and transmis-sion of spoken language requires a much larger communi-cation channel bandwidth and more sophisticated pro-cessing than a Morse code for a written text. Practical experiences with cochlear implants in the Aast indicate that some natural relationships (such as growth of loud-ness and voice pitch variations) should be maintained in the encoding process. One might therefore conceive a third, more realistic, hypothesis described as follows: sig-

nal processing for cochlear implants should carefully select a subset of the total information contained in the sound signal and transform these elements into those physical stimulation parameters which can generate dis-tinctiv e perceptions for the listener.

Many researchers have designed and evaluated differ-ent systems varying the number of electrodes and the amount of specific speech feature extraction and mapping transformations used [1]. Recently, Wilson et al. [2] have reported astonishing improvements in speech test perfor-mance when they provided their subjects with high rate pulsatile stimulation patterns rather than analog broad-band signals. They attributed this effect partly to the decreased current summation obtained by nonsimulta-neous stimulation of different electrodes (which might otherwise have stimulated partly the same nerve fibers and thus interacted in a nonlinear fashion) and partly to a fundamentally different and maybe more natural firing pattern due to an extremely high stimulation rate. Skinner et al. [3] also found significantly higher scores an word and sentence tests in quiet and noise with a new multi-peak digital speech coding strategy as compared to the formerly used FOF1F2 strategy of the Nucleus-WSP (wearable speech processor).

These results indicate the potential gains which may be obtained by optimizing signal processing schemes for existing implanted devices. With single chip programma-

300 Dillier/Bögli/Spillmann

Digital Speech Processing

Table 1. Subjects Patient identification

U.T. T.H. H.S. S.A. K.W.

Sex Date of birth, month/year

Female 6/1941

Male 2/1965

Male 11/1944

Female 7/1962

Male 3/1947

Etiology Sudden deafness

Trauma Sudden deafness

Sudden deafness

Meningitis

Duration, years Implantation date

Side Speech processor

Strategy Electrodes

Stimulus mode T/C level

(mean charge/phase), nC

15 3/87 Left WSP FOF1F2 16 BP

73/137

3 4/87 Right MSP FOF1F2 20 BP+1

79/157

14 11/88 Right MSP MPEAK 19 BP

38/84

1 3/89 Left MSP MPEAK 20 BP

37/62

28 12/90 Left MSP MPEAK 18 BP

74/130

Pulse width, ps 150-204 204 100 100 204

Sentence test (4AFC), 90 85 80 85 95

2-digit number test, % 55 95 85 40 80

Monosyllables test, 5 20 15 20 10

MPEAK = Multipeak; 4AFC = four alternative forced choice; BP = bipolar.

ble digital signal processors (DSPs), it has become possi-ble to evaluate different speech coding strategies in rela-tively short laboratory experiments with the same sub-jects. Figure 1 shows the basic differences between analog and digital signal processing. In addition to the well-known strategies realized with analog filters, amplifiers and logic circuits, a DSP approach allows the implemen-tation of much more complex algorithms. Changes in DSP algorithms require only software or parameter changes in contrast to modifications of electronic hard-ware which are necessary with analog devices. Further miniaturization and low power Operation of these proces-sors will be possible in the near future. The present study was conducted in order to explore new ideas and concepts of multichannel pulsatile speech encoding for users of the Clark/Nucleus cochlear prosthesis. Similar methods and tools can however be utilized to investigate alternative coding schemes for other implant systems equally well.

Subjects and Test Procedures

Evaluation experiments have been conducted with 5 postlin-gually deaf adults (age 26-50 years) who are cochlear implant users. As can be seen from table 1, all subjects were experienced users of their speech processors. The time since implantation ranged from 5

months (K.W.) to nearly 10 years (U.T., single channel extracochlear implantation in 1980, reimplanted after device failure in 1987) with good sentence identification (80-95% correct responses) and num-ber recognition (40-95% correct responses) performance and minor open speech discrimination in monosyllabic word tests (5-20% cor-rect responses, all tests presented via computer, hearing alone) and limited use of the telephone. One subject (U.T.) still used the old wearable speech processor (WSP) which extracts only the first and second formant and thus stimulates only two electrodes per pitch period. The other 4 subjects used the new miniature speech processor (MSP) with the so-called multipeak strategy whereby in addition to first and second formant information three fixed electrodes may be stimulated to convey information contained in three higher fre-quency bands.

The same measurement procedure to determine thresholds of hearing (T levels) and comfortable listening (C levels) was used for the cochlear implant digital speech processor (CIDSP) strategies as was used for fitting the WSP or MSP. Figure 2a shows one example of measured thresholds of hearing (T levels) and comfortable listen-ing levels (C levels) for 21 bipolar electrode pairs (subject S.A.). There can be considerable variation in these values from electrode to electrode which may reflect different electrode-to-neuron distances or varying neural excitability. Amplitude and pulse width are inversely related according to figure 2b. As most subjects used fixed amplitudes and varying pulse widths (so-called stimulus levels) with their MSPs and the CIDSP algorithms required fixed pulse widths and varying amplitudes, all T and C levels were first remeasured prior to speech tests. Overall loudness of processed signals was adjusted by proportional factors (T and C modifiers) if necessary fol-lowing short listening sessions with ongoing speech and environmen-

301

CHARGE PER PHASE (nC)

DYNAMIC RANGE (DB) 1000

100

40

-30

-20

-10

10 0 1 2 3 4 5 6 7 8 9 101112131415161718192021

ELECTRODE NUMBER

(100 psec, BP)

a

TLEV T CLEV D RANGE

PULSE WIDTH (usec)

100

10

21-T

21-C

x- 16-T

- 16-C

-x- 11-T

-9- 11-C

6-T

▪ 6-C

100

1000

CURRENT AMPLITUDE (NA)

(100 psec, BP)

b

Subj. Loudness

1.0

0.8

0.6

0.4

0.2

0.0 100

1000 Amplitude (uA)

(100 psec, BP)

c

1 EL. 21 -4>- EL. 16 x EL. 12 n EL. 6

EL. 18 EL. 14 EL. 10 -1" EL. 2

tal sounds played from a tape recorder. Loudness growth functions were measured using an automated randomized psychophysical test procedure to determine appropriate amplitude mapping functions (fig. 2c). Only minimal exposure to the new processing strategies was possible due to time restrictions. After about 5-10 min of listening to ongoing speech, one or two blocks of a 20-item 2-digit number test with feedback of correct or wrong responses were done. There was no feedback given during the actual test trials. All test items were pre-sented by a second computer which also recorded the subjects responses entered via touch screen terminal (for multiple choice tests) or keyboard (number tests and monosyllable word tests). Speech signals were either presented via loudspeaker in a sound-treated room (when patients were tested with their wearable speech processors) or processed by the CIDSP in real time and fed directly to the transmitting coil at the subjects head. Different speakers were used for the ongoing speech, the number test and the actual speech tests, respectively.

Signal Processing Strategies A CIDSP for the Nucleus 22-channel cochlear prosthesis was

designed using a single chip digital signal processor (TMS320C25, Texas Instruments [4]. For laboratory experiments, the CIDSP was incorporated in a general purpose computer which provided interac-tive parameter control, graphical display of input/output and buffers and offline speech file processing facilities. The experiments de-scribed in this paper were all conducted using the laboratory version of CIDSP.

Speech signals were processed as indicated in figure 3. After ana-log low pass filtering (5 kHz) and analog-to-digital conversion (10 kHz), preemphasis and Hanning windowing (12.8 ms, shifted by 6.4 ms or less per analysis frame) were applied, and the power spec-trum calculated via fast Fourier transform; specified speech Features, such as formants and voice pitch, were extracted and transformed according to the selected encoding strategy; finally, the stimulus parameters (electrode position, stimulation mode, pulse amplitude and duration) were generated and transmitted via inductive coupling to the implanted receiver. In addition to the generation of stimulus parameters for the cochlear implant, an acoustic signal based an a perceptive model of auditory nerve stimulation was output simulta-neously.

Two main processing strategies were implemented an this sys-tem. The first approach (PES, pitch excited sampler) is based an the maximum peak channel vocoder concept whereby the time-averaged spectral energies of a number of frequency bands (approximately third octave bands) are transformed into appropriate electrical stim-ulation parameters for up to 22 electrodes (fig. 4, left). The pulse rate at any electrode is controlled by the voice pitch of the Input speech signal. A pitch extractor algorithm calculates the autocorrelation function of a low-pass-filtered segment of the speech signal and

Fig. 2. Psychophysical measurement data for subject SA. a Thresholds of hearing (T level, T) and comfortable listening level (C level, C). b Combinations of pulse width versus amplitude values for T and C levels measured with 4 different bipolar electrode pairs. c Loudness growth functions measured for 8 different bipolar elec-trode (EL.) pairs. The stimulation mode was bipolar (BP) for all mea-surements.

302 Dillier/Bögli/Spillmann Digital Speech Processing

6 spectral peaks

PES

All bands above NCL

Electrodes Electrodes CIS-NA

Acoustic

Model Loudspeake

Digital speech • Power spectrum Feature Encoding for

signal

extraction Transmission

Transmitter

Receiver

lmplanted

Electrodes

Fig. 3. Digital signal processing steps.

j0 Power Spectrum /a/

nnnnnnnnnnn nnnnn n n

0 dB

-20 dB

NCL

-40 dB

0

5 kHz

u 11 11

11 n 0

1.1 11

II 0 1.I

IJ Lri

u

11

11

0 0

II I

nn n u

11 n n 0 11

n 0 IJ

61

11

22 22 u u

Tp

Tp

Tp Tp Time T s T s

Tim°

Fig. 4. Schematic display of PES- and CIS-NA-coding strategies. The Power spectrum (64 frequency points) is divided into 22 frequency bands. Six spectral peaks are selected and mapped to 6 electrodes for the PES strategy, all energy values above a preset noise cut level (NCL) are mapped to corresponding electrodes for the CIS-NA strategy. Tp = Pitch period; Ts = stimulus rate period.

303

% CORRECT RESPONSES (CL corrected)

100 -" 90

80

70 -

60

50 -

40 -

30 -

20

10

0 8 Vowels 12 Consonants

Fig. 5. Summary of vowel and consonant identification test results. Average total per-centages of correct responses with four dif-ferent processing strategies. Scores were cor-rected for chance level as follows: S = (R-CL)/( 1 00—CL), where R = raw score (%), CL = chance level (%). • = WSP/MSP; 'I A

PES; = CIS-NA; EI= CIS-WF.

searches for a peak within a specified time lag interval. A random pulse rate of about 150-250 Hz is used for unvoiced speech por-tions.

The second approach (CIS, continuous interleaved sampler) uses a stimulation pulse rate which is independent of the fundamental frequency of the input signal. The algorithm scans continuously all frequency bands and samples their energy levels (fig. 4, right). As only one electrode can be stimulated at any instance of time, the rate of stimulation is limited by the required stimulus pulse widths (deter-mined individually for each subject) and the time to transmit addi-tional stimulus parameters. As the information about the electrode number, the stimulation mode, the pulse amplitude and width is encoded by high frequency bursts (2.5 MHz) of different durations, the total transmission time for a specific stimulus depends on all of these parameters. This transmission time can be minimized by choosing the shortest possible pulse width combined with the maxi-mal amplitude. For very short pulse durations, the overhead im-posed by the transmission of the fixed stimulus parameters can become rather large. Consider for example the stimulation of elec-trode pair (21, 22) at 50 ps. The maximally achievable rate varies from about 3,600 Hz for high amplitudes to about 2,700 Hz for low amplitudes whereas the theoretical limit would be dose to 10,000 Hz (biphasic pulses with minimal interpulse interval). In cases with higher pulse width requirements (which may be due to poor nerve survival or unfavorable electrode position or other unknown factors), the overhead will become smaller.

In order to achieve maximally high stimulation rates for those portions of the speech input signals which are assumed to be most important for intelligibility, several modifications of the basic CIS strategy were designed of which only the two most promising will be considered in the following. The analysis of the short time spectra was performed either for a large number of narrow frequency bands (corresponding directly to the number of available electrodes) or for a small number (typically 6) of wide frequency bands analogous to the approach suggested by Wilson et al. [2]. The frequency bands were logarithmically spaced from 200 to 5,000 Hz in both cases. Spectral energy within any of these frequency bands was mapped to stimulus amplitude at a selected electrode as follows: all narrow band analysis channels whose values exceeded a noise cut level were used

for CIS-NA whereas all wide band analysis channels irrespective of the noise cut level were mapped to preselected fixed electrodes for CIS-WF. Both schemes are supposed to minimize electrode interac-tions by preserving maximal spatial distances between subsequently stimulated electrodes. In both the PES and the CIS strategies, a high frequency preemphasis was applied whenever a spectral gravity mea-sure exceeded a preset threshold.

Results

Systematic investigations of these processing strategies with 5 subjects were performed which are still ongoing. Figure 5 summarizes the results for consonant and vowel identification tests. The average scores for consonant tests with the subjects own wearable speech processor were sig-nificantly lower than with the new CIDSP strategies. The pitch-synchronous coding (PES) resulted in worse perfor-mance compared to the coding without explicit pitch extraction (CIS-NA and CIS-WF). Vowel identification scores on the other hand did not improve by modifica-tions of the signal processing strategy.

A more detailed analysis of the consonant tests is shown in figure 6 for all subjects. The results (12 conso-nants in /aCa/ context, at least 144 trials per condition) are presented as percentages of information transmitted according to the method described by Miller and Nicely [5] for the phonological features voicing, sonorance, sibilance, frication and place of articulation as listed in table 2. The analysis of the confusion matrices revealed a rather complex pattern across subjects, conditions and speech features. Overall information was best transmitted with the CIS-NA strategy (except for subject T.H., who

304 Dillier/Bögli/Spillmann

Digital Speech Processing

Transmitted Information (%)

100

80 —

60 —

40 — - \

20 —

/ 7\

OVERALL VOI NAS SON


100

80

60

40

20

7\ /\ 0 OVERALL VOI NAS SON SIB


100 —

80 1'

60 —

40 —

20

7\

SIB FRI

c

PLC OVERALL VOI NAS SON

a

b

FRI

PLC

FRI SIB

7\

PLC

Fig. 6. Information transmission analy-sis for 12-consonant identification test con-fusion matrices. Four different signal pro-cessing strategies. a Subject U.T. b Subject T.H. c Subject H.S.

VOI = Voicing; SON = sonoranze; SIB = sibilance; FRI = frication; PLC = place of articulation; NAS = nasality. = MSP, except in a (WSP); = PES; 3 = CIS-NA; Ej= CIS-WF.

305

d Z4


100

80 —

60 —

40 —

20 —

OVERALL VO NAS SON SIB FRI PLC


100

80

60 —

40 -

20 -

0 7\ 7\

OVERALL VOI

7

/\ NAS SON SIB FRI PLC

e

Fig. 6 (continued). d Subject S.A. e Subject K.W.

Table 2. Consonant phoneme features scored slightly higher with CIS-WF). Improvements of 40% (U.T.) and 20% (T.H. and H.S.) could be observed

Phoneme relative to the subjects own wearable speech processors. It p t k b d g mn 1 r f s can be seen in figure 6 that these subjects performed sig-

nificantly better with at least some of the new CIDSP Voicing - - - + + + + + + + - - strategies than with their own wearable speech processors. Nasality + + - No significant improvement with either PES nor CIS was Sonorance - + + + + + + + noted for subject K.W. The best transmitted speech fea- Sibilance ture for most subjects and strategies was sonorance. The

Frication + + largest improvements with CIS for U.T. and T.H. were Place 1 2 3 1 2 3 1 2 2 3 2 2 achieved for sibilance and frication whereas the other subjects showed either moderate improvement or even

worse performance for these high frequency features with CIS compared to their own wearable MSP. Considerable

306

Dillier/Bögli/Spillmann Digital Speech Processing

improvements in transmitted voicing information was observed for all subjects except K.W. with CIS although this processing mode does not explicitly encode this fea-ture. The improvements for place of articulation trans-mission finally (U.T., T.H., S.A.) may indicate that in-creased stimulation rates are indeed more effective in sig-nalling formant transitions which distinguish phonemes articulated at different vocal tract positions.

Discussion and Conclusions

The above speech test results should be regarded as preliminary. The number of subjects is still very small, and data collection has not yet been completed for all of them in every processing condition.

It is however very promising at this point that new Sig-nal processing strategies can improve speech discrimina-tion considerably. Consonant identification apparently may be enhanced by more detailed temporal information and specific speech feature transformations. Whether

these improvements will pertain in the presence of inter-fering noise also remains to be verified. Further optimiza-tion of these processing strategies should preferably be based an more specific data about loudness growth func-tions for individual electrodes or additional psychophysi-cal measurements.

Although many aspects of speech encoding can be effi-ciently studied using a laboratory DSP it would be desira-ble to allow subjects more time for adjustment to a new coding strategy. Several days or weeks of habituation are sometimes required until a new mapping can be fully exploited. Thus for scientific as well as practical purpose, the miniaturization of wearable DSPs will be of great importance.

Acknowledgements

This work was supported by the Swiss National Research Foun-dation (grants No. 4018-10864 and 4018-10865). Implant surgery was performed by Prof. U. Fisch. Valuable help was also provided by Dr. E. von Wallenberg of Cochlear AG, Basel, Switzerland.

References

1 Clark GM, Tong YT, Patrick JF: Cochlear Prostheses. Edinburgh, Churchill Livingstone, 1990.

2 Wilson BS, Lawson DT, Finley CC, Wolford RD: Coding strategies for multichannel coch-lear prostheses. Am J Otol 1991;12:(suppl 1): 55-60.

3 Skinner MW, Holden LK, Holden TA, Dowell 4 RC, et al: Performance of postlingually deaf adults with the wearable speech processor (WSP III) and mini speech processor (MSP) of the Nucleus multi-electrode cochlear implant. Ear Hear 1991;12:3-22. 5

Dillier N, Senn C, Schlatter T, Stöckli M, Utzinger U: Wearable digital speech processor for cochlear implants using a TMS320C25. Acta Otolaryngol Suppl (Stockh), 1990;469: 120-127. Miller GA, Nicely PE: An analysis of percep-tual confusions among some English conso-nants. J Acoust Soc Am 1955;27:338-352.

307

https://www.researchgate.net/publication/21087220_Coding_Strategies_for_Multichannel_Cochlear_Prostheses?el=1_x_8&enrichId=rgreq-d22f03e91b574dbe9edfd7508887fd55-XXX&enrichSource=Y292ZXJQYWdlOzIxNjgwODE5O0FTOjIxMDUyNTY4MzIyODY3OEAxNDI3MjA0NjQyNzk1




https://www.researchgate.net/publication/302384838_An_analysis_of_perceptual_confusions_among_some_English_consonants?el=1_x_8&enrichId=rgreq-d22f03e91b574dbe9edfd7508887fd55-XXX&enrichSource=Y292ZXJQYWdlOzIxNjgwODE5O0FTOjIxMDUyNTY4MzIyODY3OEAxNDI3MjA0NjQyNzk1



https://www.researchgate.net/publication/20798808_Wearable_digital_speech_processor_for_cochlear_implants_using_a_TMS320C25?el=1_x_8&enrichId=rgreq-d22f03e91b574dbe9edfd7508887fd55-XXX&enrichSource=Y292ZXJQYWdlOzIxNjgwODE5O0FTOjIxMDUyNTY4MzIyODY3OEAxNDI3MjA0NjQyNzk1



Documents

Digital Speech Processing for Cochlear Implants