4
HAVE 2007 - IEEE International Workshop on Haptic Audio Visual Environments and their Applications Ottawa - Canada, 12-14 October 2007 Audio-Haptic Feedback in Speech Processing Zygmunt Ciota Department of Microelectronics and Computer Science, Technical University of Lodz, Al. Politechniki 11, 90-924 Lodz, Poland E-mail: [email protected] Abstract - The main goal of the paper is to achieve better human our goal is focused on the correctness of emotion recognition, communication and interaction during conversation process by the number of emotions is inversely proportional to the using a supervising of emotional states and improving voice quality, recognition efficiency. The simplest analysis of only two Therefore, the proposed approach should be also very helpful in the emotions: negative and on-negative, can be sometimes very case of vocal tract illness for monitoring of treatment process. Since useful, especially in medical applications. haptic feedback can nowadays operate by using different sense of touch, like kinesthetic, tactile, cutaneous or force feedback, then such feelings can be also helpful directly in medical treatments as the supplementary method. Ke words Audio-haptic interaction, Speech processing, Glottis excitation, Emotion recognition 1. INTRODUCTION The paper presents an attempt of speech feature's analysis and its possible application in haptic-audio interfaces oriented on medical applications. Mixed time- and frequency-domain Time [sec.] analysis of voice signal gives a big number of feature vectors. Currently, such vectors are used generally in speech synthesis, speech recognition or speaker identification. It The calculation of the speech features is rather complicate seems however, that some of features can be used in haptic interfaces (tcnlois craig toete wihaui and generally iS based on time-frequency diagrams presented interformat multimhnodalogies) cre gene th [ r w8]. as spectrograms. An example of spectrogram, calculated inomain a multmoda eniomn '4 .8] using Hamming window and fast Fourier transform, iS shown Therefore, the possibilities of modem speech processing using. H. should be first discussed. One of the most important tasks is a proper definition of feature parameters. Since usually II. EMOTION RECOGNITION impediments of speech depends strongly on emotional state of a speaker, a proper control of emotions plays a critical role in~ ~ ~~~~~~~n th tramn prces Th spaigfudmna The proper choice of speech features has very significant fren e a ratime-enoer . dTr spearmts pertat influence for an efficiency of emotion recognition. The most freuenie an tieeeg ditibto paaetr pemt o portant and useful features can be gathered into four main create emotion recognition system. The possibilities of im different realization of decision algorithm for emotion groups: the long-term spectrum (LTS), the speaking classification have been also discussed. Researches fundamental frequencies (SFF), the time-energy distribution concering psychological and neurobiological analysis are (TED) and vowel formant tracking (VFT) [1, 2, 3]. Different also important and can improve the efficiency of intelligent emotions can be presented in multidimensional space as a human-machine interface. Unfortunately, it is impossible to function of the above features. An example of such function find sharp boundaries between different emotions. Moreover, is shown in Fig. 2, where you have two-dimensional space r1-rr r * * * 1 1 ^ describing by two axes: "Energy" and "Quality". If the because a lot of different groups of scientists are involved in g y gy Quty this topic, it is also impossible to obtain the final agreement number of space dimensions goes up, the situation becomes concering the number of emotions and a homogenous more and more complicate, and different emotion spaces can definition of them. overlap. In such a case increasing of the features can even As a background, resulting from common behaviors of decrease recognition efficiency. human being, we can.take into account:joy,anger,sadness, In our method we applied the features describing speaking digut fer, suprs an nura saeOfcue,or bete fundamental frequency Fo and time-energy distribution of the unesadn of muua hua reain it wouldbe voice. Roughly speaking, TED iS responsible for the energy necessary to expand the list significantly, but for engineering an F o h ulit of th sec. Th fiinyo F purpose this background seems to be sufficient. Moreover, if depends on a processing system, which has to track and extract the fundamental frequencies precisely. As the results, 978-1-4244-1571-7/07/$25.OO ©2007 IEEE 67

[IEEE 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games - Ottawa, ON, Canada (2007.10.12-2007.10.14)] 2007 IEEE International Workshop on Haptic,

  • Upload
    zygmunt

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games - Ottawa, ON, Canada (2007.10.12-2007.10.14)] 2007 IEEE International Workshop on Haptic,

HAVE 2007 - IEEE International Workshop onHaptic Audio Visual Environments and their ApplicationsOttawa - Canada, 12-14 October 2007

Audio-Haptic Feedback in Speech Processing

Zygmunt CiotaDepartment of Microelectronics and Computer Science, Technical University of Lodz,

Al. Politechniki 11, 90-924 Lodz, PolandE-mail: [email protected]

Abstract - The main goal of the paper is to achieve better human our goal is focused on the correctness of emotion recognition,communication and interaction during conversation process by the number of emotions is inversely proportional to theusing a supervising ofemotional states and improving voice quality, recognition efficiency. The simplest analysis of only twoTherefore, the proposed approach should be also very helpful in the emotions: negative and on-negative, can be sometimes verycase ofvocal tract illness for monitoring of treatment process. Since useful, especially in medical applications.haptic feedback can nowadays operate by using different sense oftouch, like kinesthetic, tactile, cutaneous or force feedback, thensuch feelings can be also helpful directly in medical treatments asthe supplementary method.

Ke words Audio-haptic interaction, Speech processing, Glottisexcitation, Emotion recognition

1. INTRODUCTION

The paper presents an attempt of speech feature's analysisand its possible application in haptic-audio interfaces orientedon medical applications. Mixed time- and frequency-domain Time [sec.]analysis of voice signal gives a big number of feature vectors.Currently, such vectors are used generally in speechsynthesis, speech recognition or speaker identification. It The calculation of the speech features is rather complicateseems however, that some of features can be used in haptic

interfaces (tcnlois craig toete wihaui and generally iS based on time-frequency diagrams presentedinterformat multimhnodalogies) cregene th[rw8]. as spectrograms. An example of spectrogram, calculatedinomain a multmoda eniomn '4.8] using Hamming window and fast Fourier transform, iS shownTherefore, the possibilities of modem speech processing using. H.

should be first discussed. One of the most important tasks is aproper definition of feature parameters. Since usually II. EMOTION RECOGNITIONimpediments of speech depends strongly on emotional stateof a speaker, a proper control of emotions plays a critical rolein~ ~~~~~~~~nthtramnprces Th spaigfudmna The proper choice of speech features has very significant

fren e aratime-enoer .dTr spearmts pertat influence for an efficiency of emotion recognition. The mostfreuenie an tieeeg ditibto paaetr pemt oportant and useful features can be gathered into four maincreate emotion recognition system. The possibilities of im

different realization of decision algorithm for emotion groups: the long-term spectrum (LTS), the speakingclassification have been also discussed. Researches fundamental frequencies (SFF), the time-energy distributionconcering psychological and neurobiological analysis are (TED) and vowel formant tracking (VFT) [1, 2, 3]. Differentalso important and can improve the efficiency of intelligent emotions can be presented in multidimensional space as a

human-machine interface. Unfortunately, it is impossible to function of the above features. An example of such functionfind sharp boundaries between different emotions. Moreover, is shown in Fig. 2, where you have two-dimensional space

r1-rr r * * * 1 1 ^ describing by two axes: "Energy" and "Quality". If thebecause a lot of different groups of scientists are involved in g y gy Qutythis topic, it is also impossible to obtain the final agreement number of space dimensions goes up, the situation becomes

concering the number of emotions and a homogenous more and more complicate, and different emotion spaces can

definition of them. overlap. In such a case increasing of the features can even

As a background, resulting from common behaviors of decrease recognition efficiency.human being, we can.take into account:joy,anger,sadness, In our method we applied the features describing speakingdigut fer, suprs an nura saeOfcue,or bete fundamental frequency Fo and time-energy distribution of theunesadn of muua hua reain it wouldbe voice. Roughly speaking, TED iS responsible for the energy

necessary to expand the list significantly, but for engineering an F o h ulit of th sec.Th fiinyo Fpurpose this background seems to be sufficient. Moreover, if depends on a processing system, which has to track and

extract the fundamental frequencies precisely. As the results,

978-1-4244-1571-7/07/$25.OO ©2007 IEEE 67

Page 2: [IEEE 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games - Ottawa, ON, Canada (2007.10.12-2007.10.14)] 2007 IEEE International Workshop on Haptic,

statistical behaviors of these frequencies have to be also noise eneraton glottisincluded and fortunately, these frequencies are very sensitiveto different emotions of the same speaker, like anger, joy, variable transmittance tsadness or boredom.

Energy radaton mpedance of mouth

Anger excited Joy final signal

afraid of speech--,I' , Fig. 4. Modeling of the whole vocal tract

- " Sadness Neutral QualityI;|il C The process of emotion recognition consists of two main

Bored |I relax parts: emotion teaching and appropriate recognition. Duringthe teaching process one can create the base of parameters forall emotions. The comparison of current voice with the base

Fig. 2. Emotions in two-dimensional space described by energy and qualityof the speech gives the answer concerning the emotional state of examined

utterance. The comparison process and the final decision are

The input signal corresponding to glottal waves can be based on two classifiers: nearest mean (NM) and nearestsimulated using different models. An example of useful neighbour (NN). The decision process can be optimized usingmodel based on electro-mechanical equivalents of human different distances and parameter weights. This part ofglottis is shown in Fig. 3. Such model can be used as an input method is very important and still open. Especially, in theof whole vocal tract including models of nasal and mouth present of low quality teaching materials, it would betracts, according to Fig. 4. necessary to applied probabilistic method and multilayer

In the case of fundamental frequency calculation, two neural perceptrons [6].basic methods are available: autocorrelation and cepstrummethod. The first permits to obtain precise results, but we III. FUNDAMENTAL FREQUENCY FEATURESdiscovered that additional incorrect glottis frequencies havebeen created. Additionally, program indicates some glottis In the case of medical application haptic feedback shouldexcitations during breaks between phones and in silence be helpful in diagnosis and treatment processes. Malfunctionregions. Therefore, in this method it is necessary to apply of pronunciation depends strongly on such emotional statesspecial filtersto eliminate allincorrect frequencies. like anger, fear, sadness and boredom, because of glottis

signal deterioration. In the other words, the fundamentalfrequency Fo becomes unstable. Information of such changes

sSi r1 can be helpful for doctors for progress observation of glottistreatment. Moreover, such acoustic feedback should beVg(t) Voice

Ps Glottis channel helpful for patient himself, increasing his attempt to improvem2 F / E Xthe pronunciation of the speech. The threshold, in which the

S2 r2 frequency properties of the glottis signal make worse, can be77777777\\\\\ obtained during learning process using database of patient's

Fig. 3. Modeling of glottis behavior utterances. From technical point of view the haptic feedbackcan be realized using a ring attached to a hand and containing

Another method bases on cepstrum analysis. In this a vibrator. The frequency of vibration should be dependant ontransform the convolution of glottis excitation and vocal tract the intensity of negative emotions.is converted, first to the product after Fourier transform,separated them finally as the sum. In our method we use Table 1.F0parametersformalevoicecepstrum analysis as less complex, especially when we Fo-mean Fo-max Fo-min Fo-rangeapplied modulo of cepstrum by using modulo of Fourier Anger 172 Hz 228 Hz 120 Hz 108 Hztransform. The following values of glottis frequency Fo have Neutral 106 Hz 142 Hz 94 Hz 48 Hzbeen taken into account: minimum and maximum values: Table 2. F0 parameters for female voiceFo-minimum and Fo-maximum, the mean value Fo-mean, and the rangeFor.ange. The parameters describing time-energy distribution I|-II F -mean Fo-max J Fo-min J Fo-range

. . ~~~~~~~Anger201Hz 232 Hz J106Hz J126Hzhave been calculated using fast Fourier transform and [ Joy 1209Hz 240 Hz J165 Hz J 76 Hzdividing speech utterances into 20 ms slots. As the result weobtained eight values of the energy for the following The system require several time-to-frequency transforms,frequency ranges: 0-400Hz, 400-800Hz, 800-1500Hz, 1500- but it is necessary assure real-time action. Therefore, the2000Hz, 2000-3500Hz, 3500-5000Hz and 5000-8000Hz. implementation of the proposed algorithms using hardware-

68

Page 3: [IEEE 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games - Ottawa, ON, Canada (2007.10.12-2007.10.14)] 2007 IEEE International Workshop on Haptic,

software system, including mixed analog-digital approach, circuits is inherently a complex task involving humanshould improve the speed and the quality of proper action. expertise as well as aids intended to accelerate the process.Application of mixed digital-analog realization to the designprocess of sound processors may be better in comparisonwith purely digital solution and very often we can achieve 20

better results, decreasing the chip surface and increasing thespeed parameter of the system.

Extraction of fundamental frequency parameters for male E 6

and female voices is shown in Table 1 and Table 2,respectively. It is very easy to observe high sensitivity of 1FrqecH Ik 1Oksuch parameters for emotional state of the speaker. The range _of Fo is equal to 126 Hz for anger speech of a women, while Fig. 5. Amplitude characteristics of switched current filter for different clockin the case ofjoy speech the corresponding range decreases to frequencies76 Hz. Similarly, high sensitivity of fundamental frequencyparameters can be observed for angry and neutral speech of a Furthermore, such efficient system has to have real-timeman (see Table 1). capabilities, so the hardware-software co-design permits to

achieve low cost and high-speed performances. WhileIV. MIXED HARDWARE-SOFTWARE APPROACH microcontrollers and microprocessors are inherently digital

components, some functions can be executed in analog or

The design process of integrated circuits is focused on the digital form. An example of simple speech recognitionfollowing two main directions: to scale down the dimensions system based on ARM processor is shown in Fig. 6.of the transistors, and to incorporate as many building blocksas possible in a single chip. It means that modem CMOStechnology requires low-voltage power supply.

VLSI techniques usually demand high-performance A/Dand D/A converters, because digital circuits have to beinterconnected to the real and in most cases analog world. Asa consequence, almost every fully integrated system containssome specific mixed blocks. In many applications analog todigital converter remains the most important component ofanalog circuits. The design process of high performance______________converters is complicated and time consuming. The Fig. 6. An example of speech processing prototype using ARM processorperformance depends also no the process technology,therefore it is necessary to obtain a compromise between the The general trend is towards digital solutions, whichprecision, power consuming and the cost of the chip. guarantee high density and easy design. Available digital

The design process based on a current-mode approach can design tools, for example Cadence environment, canbe applied in standard CMOS technology Analog integrated 'beppieinstnlgy. A g iegrd automatically convert the logic scheme to the chip layout. Wecircuits can be divided in time continuous and sampled-date expect that hardware realization of some components can betechniques. In the sampled-date group switched-current useful for smart real-time applicable recognition system.circuits are the most important, because this techniquerequires only a standard digital CMOS process. Such circuitsuse MOS transistors as storage elements to provide analogmemory capability. Using current mirror principles, a voltageis sampled onto the gate of MOS transistor and held on itsgate capacitance. An example of the family of frequencycharacteristics for switched current filter, made as integratedCMOS prototype, is presented in Fig. 5.

Current mode realizations of analog components permit todecrease chip area significantly. As the most importantanalog realizations we can mention filters and analog-to-digital converters. Comparison of the different designmethods permits to choose the best one with regard to the

Page 4: [IEEE 2007 IEEE International Workshop on Haptic, Audio and Visual Environments and Games - Ottawa, ON, Canada (2007.10.12-2007.10.14)] 2007 IEEE International Workshop on Haptic,

silicon die (see Fig. 7). The bulk CMOS technology with all processing depends strongly on different environments. It isadvantages (economic) an disadvantage (quality) was used. also necessary to take into account, that different words haveThe future belongs to SOI CMOS technology, especially in different recognition difficulties, therefore the precision ofarea of System-on-Chip integration. Better substrate noise extraction depends on a spoken text. We expect that hardwareisolation, lower threshold voltages of MOS transistors and realization of some components, e.g. predictive filters, neurallower power dissipation are key aspects in this case. network units, can be useful for smart real-time applicable

Taking into account the above remarks, a system of audio predictors.processing has been designed and performed in CMOS The implementation of the proposed algorithms astechnology [9]. The prototypes contain high precision CMOS hardware-software system, including mixed analog-digitalintegrated filters with low sensitivity to technological approach, should improve the speed and also the quality andmismatches. The proposed laboratory stand permits to obtain resolution of the system. The presented prototypes containaccurate current measurements for such circuits. The high precision CMOS integrated filters with low sensitivity toprototype contains also digital memory to control all technological mismatches. The proposed laboratory standfunctions of audio signal processing, including software radio permits to obtain accurate current measurements for suchcapabilities [9]. circuits.

The digital part has driving purposes, consisting from We expect that development of our existing speechanalogue part configuration system, and also software radio processing systems by adding haptic interfaces, can enlargesignal processing. The novel method of FM signal the area of possible application in medical treatment anddemodulation was introduced based on asynchronous digital diagnosis of speech abnormalities.circuit, using standard CMOS digital cells. Direct pulsecounting was used for FM detection, the obtained samples ACKNOWLEDGMENTwas processed by the fixed-point arithmetic circuit. Digitalsigma-delta modulator was used as additional output The research effort is sponsored by the grant of Polishinterface. The analogue part has been designed as switched- Ministry of Education and Science No. 3T1 lB 027 29current reprogrammable cells. The coefficients for this partare loaded from the digital part. REFERENCES

The presented results show new possibilities of the systemintegration level. Combining analogue SI circuits and digital [1] Z. Ciota: "Speaker Verification for Multimedia Application", IEEEsignal processing at the same chip we reduce overall costs International Conference on Systems, Man and Cybernetics, Octobergiving more robust and flexible system. The System-on-Chip 10-13, 2004, Hague, Netherlands, pp. 2752-2756allows to build complete structure for to date,.ultimedia [2] Progress in speech synthesis, edited by J. Santon et al., Springer, Newallows to build complete structure for up to date multimedia York 1996

purposes with low power dissipation and modest costs in high [3] Z. Ciota: "Emotion Recognition on the Basis of Human Speech",volume production series that is very important for many of ICECom-2005, 18th International Conference on Appliedthe modem portable devices. Electromagnetics and Communications, 12-14 October 2005,

Dubrovnik, Croatia, pp. 467-470Presented in this paper reprogrammable SI cells could be [4] Charlotte Magnusson, Kirsten Rassmus-Grohn: "Audio haptic tools for

treated as first step in exploration of the very promising navigation in non visual environments", Proc. ENACTIVE05 2ndtechnique, often proclaimed as the antecessor of widely used International Conference on Enactive Interfaces, Genoa, Italy,SC discrete-time processing. Lack of the industrial proposals November 17-18, 2005[5] G. Nikolakis, I. Tsampoulatidis, D. Tzovaras and Michael G. Strintzis:is a gap that potentially could be filled by reprogrammable SI Haptic Browser: "A Haptic Environment to Access HTML Pages",circuits. Higher usage of SI could reveal and improve its SPECOM'2004: 9th Conference Speech and Computer, St. Petersburg,negative features, which might develop better solution for Russia, September 20-22, 2004discrete-time signal processing. [6] Chulhee Lee, Donghoon Hyun, Euisun Choi, Jinwook Go, Chungyong

Current works aeoLee: "Optimizing Feature Extraction for Speech Recognition". IEEECurrent works are focused on the integration possibilities Trans. Speech and Audio Processing, no 1, January 2003, pp. 80-87

of haptic interface, together with speech processing system, [7] Z. Ciota: "Improvement of speech processing using fuzzy logicin a single chip. approach", Proc. Joint 9th IFSA Word Congress and 20th NAFIPS

International Conference, July 25-28, 2001, Vancouver, Canada, pp.727-73 1

V. CONCLUSIONS [8] B.D. Adelstein, D.R. Begault, M.R. Andersonl, E.M. Wenzel:"Sensitivity to Haptic-Audio Asynchrony", Proceedings, 5th

Detailed analysis of the recognition efficiency leads us to International Conference on Multimodal Interfaces, ACM, Vancouver,the following conclusions. Using mixed analogue-digital Canada, 2003, pp. 73-76

[9] Rominski A., Ciota Z., Napieralski A: "Audio Signal Processing Usingmethods we can obtain better acoustic signal processing Mixed Hardware-Software Approach", Proc. Nanotechnologyresults, like compression, identification and recognition. Conference and Trade Show, NSTI Nanotech, Boston, Massachusetts,Exploring more general problem, the improvement of the USA 2006, pp. 63-65method has been depended on the voice quality andenvironment conditions.

The proper and fast system of speech parametersextraction is a difficult task, because the quality of speech

70