Distributed Speech Processing in MiPad’sMultimodal User Interface

7/25/2019 Distributed Speech Processing in MiPads Multimodal User Interface

1/22

Distributed SpeechProcessing in MiPads

Multimodal User Interface

Presented by-Madhav Krishna

C0-2

1120111


2/22

1. Introduction-

1.1 GUI vs Multimodal Interface-

GUI relies heavily on a graphical display, keyboard andpointing devices that are not alays available!

Mobile co"p#ters have constraints on physical si$e and

battery poer, or present li"itations d#e to hands-b#syeyes-b#sy scenarios hich "ake traditional GUI achallenge!

%poken lang#age enabled "#lti"odal inter&aces are idelybelieved to be capable o& dra"atically enhancing the

#sability o& co"p#ters beca#se GUI and speech haveco"ple"entary strengths! 'hile spoken lang#age has thepotential to provide a nat#ral interaction "odel, thedi(c#lty in resolving the a"big#ity o& spoken lang#age andthe high co"p#tational re)#ire"ents o& speech technologyhave so &ar prevented it &ro" beco"ing "ainstrea" in aco"p#ter*s #ser inter&ace! MiPad, M#lti"odal Interactive


3/22


4/22

1. More about MiPad!"or#ing$-MiPad intends to alleviate a prevailing proble" o& pecking ithtiny styl#ses or typing on "in#sc#le keyboards in today*sP+s by adding speech capability thro#gh a b#ilt-in "icrophone!

MiPad is designed to s#pport a variety o& tasks s#ch as /-"ail,voice-"ail, calendar, contact list, notes, eb brosing, "obilephone, and doc#"ent reading and annotation! his collection o&nctions #nies the vario#s "obile devices into a single,co"prehensive co""#nication and prod#ctivity tool!'hile the entire nctionality o& MiPad can be accessed by penalone, it as &o#nd that a better #ser eperience can beachieved by co"bining pen and speech inp#ts! 3ther pointingdevices, s#ch as a roller on the side o& the device, device &ornavigating a"ong the inp#t elds, can also be e"ployed to

enable one handed operation!.he speech inp#t "ethod, called Tap & Talk, not only indicateswhere the recognized tet sho#ld go b#t also serves as a p#sh totalk b#tton! Tap & Talk narrows down the number of possibleutterances for the spoken lang#age processing "od#le! 4or

ea"ple, selecting the 5To: eld on an e-mail applicationdisplay indicates that the #ser is abo#t to enter a na"e! his


5/22

3ne key &eat#re o& MiPad is a general p#rpose 5Co""and6 eld to hicha #ser can iss#e nat#rally spoken co""ands s#ch as 5%ched#le a "eeting

ith redd to"orro at to o*clock!64ro" the #ser*s perspective, MiPad not only recogni$es b#t understandsthe command by Miad e!ecuting the necessary actions conveyed in thespoken co""ands!In response to the above co""and, MiPad ill display a 5"eetingarrange"ent6screen ith related elds s#ch as date, ti"e, attendees, etc!. lledappropriately based on the #ser*s #tterance! MiPad lly i"ple"entsPersonal In&or"ation Manage"ent PIM. nctions incl#ding e"ail,calendar, notes, task, and contact list!+ll MiPad applications are cong#red in a client7server architect#re asshon-


6/22

he client on the le&t side o& 4ig! 2 is MiPad poered by Microso&t'indos C/ operating syste" that s#pports 1. so#nd capt#re, 2.

&ront-end aco#stic processing incl#ding noise red#ction, channelnor"ali$ation, &eat#re co"pression, and error protection, 8. GUIprocessing, and 9. a &a#lt-tolerant co""#nication layer thatallos the syste" to recover gracelly &ro" netork connection&ail#res!+ ireless local area netork ':+;. connects MiPad to a host

"achine server. here the contin#o#s speech recognition C% o& r#nti"e heap, and "erely cons#"es

approi"ately 8=? o& CPU load ith iP+@*s 20A MB$ %trong+


7/22

1.% &ationale 'ehind MiPads (rchitecture-+ltho#gh c#sto"i$ed syste" so&tare and hardare have been

reported to bring etra benets and eibility in tailoringapplications to "obile environ"ents, the MiPad proDect #tili$esonly oE-the-shel& hardare and so&tare! Given the rapidi"prove"ents in the hardare and syste" so&tare capabilities,it is believed s#ch an approach is a reasonable one!

%econd, altho#gh speaker independent speech recognition has"ade signicant strides d#ring the past to decades, e havedeliberately positioned MiPad as apersonal de#ice where the #serprole can be #tili$ed to enrich applications and co"ple"enttechnological shortco"ings! 4or speech, this "eans it "ay #sespeaker dependent recognition, thereby avoiding the challenges&aced by other approaches! In addition to enabling higherrecognition acc#racy, #ser specic in&or"ation can also be storedlocally and speaker specic processing can be carried o#t on theclient device itsel&! his architect#re allos #s to create #serc#sto"i$ed applications #sing generic

servers, thereby i"proving overall scalability!


8/22

. &obustness to (coustic)nvironments-I""#nity to noise and channel distortion is one o& the "ost

i"portant design considerations &or MiPad! 'ith theconvenience o& #sing the b#ilt-in "icrophone, noise rob#stnessbeco"es a key challenge to "aintaining desirable speechrecognition and #nderstanding per&or"ance!his section ill present "ost recent res#lts in the &ra"eorko& distrib#ted speech recognition %


9/22

.1 'asic version of SP*I+)-

%P:IC/ is a &ra"e-based, bias re"oval algorith" &or cepstr#"enhance"ent #nder additive noise, channel distortion or aco"bination o& the to!%P:IC/ ass#"es no eplicit noise "odel, and the noisecharacteristics are e"bedded in the pieceise linear "appingbeteen the 5stereo6 clean and distorted speech cepstralvectors! he pieceise linearity is intended to approi"ate thetr#e nonlinear relationship beteen the to!%P:IC/ is potentially able to handle a ide range o& distortions,incl#ding non-stationary distortion, Doint additive and

convol#tional distortion, and nonlinear distortion in ti"e-do"ain. beca#se the stereo data provides acc#rate esti"ates o&the bias or correction vectors itho#t the need &or an eplicitnoise "odel!3ne key re)#ire"ent &or the s#ccess o& the basic version o&

%P:IC/ described here is that the distortion conditions #nderhich the correction vectors are


10/22

. )nhancing SP*I+) b, emporal Smoothing-

In this enhanced version o& %P:IC/, e not only "ini"i$e thestatic deviation &ro" the clean to noisy cepstral vectors as inthe basic version o& %P:IC/., b#t also seek to "ini"i$e thedyna"ic deviation!he basic %P:IC/ opti"ally processes each &ra"e o& noisyspeech independently! +n obvio#s etension is to Dointly processa seg"ent o& &ra"es! In this ay, altho#gh the deviation &ro"the clean to noisy speech cepstra &or an individ#al &ra"e co#ldbe #ndesirably greater than that achieved by the basic, static%P:IC/, the overall deviation that takes into acco#nt the holese)#ence o& &ra"es and the "is"atch o& slopes ill be red#ced

co"pared ith the basic %P:IC/!

.% )nhancing SP*I+) b, oise estimation and oiseormali/ation-In this enhance"ent o& %P:IC/, diEerent noise conditions

beteen the %P:IC/ training set and test set are nor"ali$ed!he research shoed that the eEectiveness o& the above noise-


11/22

.%.1 $on-stationary $oise %stimation by terati#e 'tochastic

(ppro!imation-+ novel algorith" is proposed, i"ple"ented, and eval#ated &orrec#rsive esti"ation o& para"eters in a nonlinear "odel involvinginco"plete data! he algorith" is applied specically to ti"e-varying deter"inistic para"eters o& additive noise in a "ildlynonlinear "odel that acco#nts &or the generation o& the cepstraldata o& noisy speech &ro" the cepstral data o& the noise andclean speech!


12/22

%. 0eature +ompression and )rrorProtection-

his is intended to address the three key re)#ire"ents &ors#ccessl deploy"ent o& distrib#ted speech recognitionassociated ith the client7server approachF 1. co"pression o&cepstral &eat#res via )#anti$ation. "#st not degrade speechrecognition per&or"ance 2. the algorith" &or so#rce andchannel coding "#st be rob#st to packet losses, b#rsty orotherise and8. the total ti"e delay d#e to the coding, hich res#lts &ro" aco"bined )#anti$ation delay, error-correction coding delay, andtrans"ission delay, "#st be kept ithin an acceptable level!

%.1 0eature +ompression-+ ne so#rce coding algorith" has been developed thatconsists o& to se)#ential stages! +&ter the standard Mel-cepstra are etracted, each speech &ra"e is rst classied to aphonetic category e!g!, phone"e. and then is vector )#anti$ed

H@. #sing the split-H@ approach! he "otivation behind thisne so#rce coder is that the speech signal can be co"posed o&


13/22

%. )rror Protection-+ novel channel coder has also been developed to protectMiPad*s Mel-cepstral &eat#res based on the client7serverarchitect#re! he channel coder assigns #ne)#al a"o#nts o&red#ndancy a"ong the diEerent so#rce bits, giving a greatera"o#nt o& protection to the "ost i"portant bits here thei"portance

is "eas#red by the contrib#tions o& these bits to the ord errorrate in speech recognition!+ )#antiable proced#re to assess the i"portance o& each bit isdeveloped, and the channel coder eploits this #tility nction &orthe opti"al &orard error correction 4/C. assign"ent!


14/22

. +ontinuous Speech recognitionand Understanding-

'hile the co"pressed and error-protected Mel-cepstral&eat#res are co"p#ted in the MiPad client, "aDor co"p#tation&or contin#o#s speech recognition decoding. resides in theserver! he entire set o& the lang#age "odel, hidden Markov"odels BMMs., and leicon that are #sed &or speech decodingall reside in the server, hich processes the Mel-cepstral&eat#res trans"itted &ro" the client!MiPad is designed to be a personal device! +s a res#lt, speechrecognition #ses speaker-adaptive aco#stic "odels BMMs. anda #ser-adapted leicon to i"prove recognition acc#racy!he speech recognition engine in MiPad #ses the #nied

lang#age "odel that takes advantage o& both r#le-based anddata-driven approaches! Consider to training sentencesF)Meeting at three with *ohn 'mith" #ersus )Meeting at four Mwith +erek"'ithin a p#re -gra" &ra"eork, e need to esti"ate ohn

three ith. and erek PM ith. individ#ally! his "akes itvery di(c#lt to capt#re the obvio#sly needed long-span


15/22

4or the ea"ple listed here, e "ay have C4Gs &or ;+M/ and IM/

respectively, hich can be derived &ro" the &actoid gra""ars o& s"allersi$es!he training sentences no look likeF)Meeting at three:TM% with *ohn 'mith:$(M% , and )Meeting at four M:TM% with +erek: $(M% "'ith parsed training data, e can no esti"ate the n-gra" probabilities

as #s#al! 4or ea"ple, the replace"ent o& ohn three ith.J;+M/I JIM/ ith . "akes s#ch 5n -gra"6 representation "ore"eaningl and "ore acc#rate!Inside each C4G, hoever, e can still derive 5ohn %"ith6L J;+M/.and 5&o#r PM6 JIM/. , &ro" the eisting n-gra" n-gra" probabilityinheritance. so that they are appropriately nor"ali$ed!


16/22

2. MiPad User Interface Designand )valuation-

+s "entioned previo#sly, MiPad does not e"ploy speechsynthesis as an o#tp#t "ethod! his design decision is"otivated "ainly by the &olloing to reasons! 4irst, despitethe signicant progress in synthesis technologies, especially inthe area o& concatenated ave&or"s, the )#ality o& synthesi$edspeech has re"ained #nsatis&actory &or large scaledeploy"ents! he "ost critical draback o& speech o#tp#t isith the non-persistent or #olatile nature of speechpresentation!he h#"an #ser "#st process the speech "essage and"e"ori$e the contents o& the "essage in real ti"e! here is no

knon #ser inter&ace design that can elegantly assist theh#"an #ser &or the cases here the speech ave&or" cannotbe easily heard and #nderstood, or there is si"ply too "#chin&or"ation to be absorbed! In contrast, a graphical display canrender a large a"o#nt o& in&or"ation persistently &or leis#re

cons#"ption, avoiding the a&ore"entioned proble"s!


17/22

MiPad takes advantage o& the graphical display in UI design!he graphical display si"plies dra"atically the dialog"anage"ent! 4or instance, MiPad is able to considerablystrea"line the conr"ation and error repair strategy as all thein&erred #ser intentions are conr"ed implicitly on the screen".hene#er an error occ#rs, the #ser can correct it thro#gh theGUI or speech "odalities that are appropriate and appear "orenat#ral to the #ser! hanks to the display persistency, #sers arenot obligated to correct errors i""ediately a&ter they occ#r!

he display also allos MiPad to conr" and ask the #ser "any)#estions in a single t#rn!

2.1 ap 3 al# Interface->eca#se o& MiPad*s s"all &or"-&actor, the present pen-based"ethods &or getting tet into a P +are potential barriers tobroad "arket acceptance! %peech is generally not as precise as

a "o#se or a pen to per&or" position-related operations!%peech interaction can also be adversely aEected by


18/22

espite these disadvantages, speech co""#nication is not

only nat#ral b#t also provides a poerl co"ple"entary"odality to enhance the pen-based inter&ace i& the strengthso& #sing speech can be appropriately leveraged and thetechnology li"itations be overco"e!In able II, e elaborate several cases hich sho that penand speech can be co"ple"entary and #sed eEectively &orhandheld devices! he advantage o& pen is typically theeakness o& speech and vice versa!


19/22


20/22

2. 4isual 0eedbac#s for Speech Inputs-

Processing latency is a ell recogni$ed iss#e in #ser inter&acedesign! his is even "ore so &or MiPad in hich distrib#tedspeech recognition is e"ployed! In addition to the recognitionprocess itsel&, the ireless netork rther introd#ces "orelatency that so"eti"es is not easily controllable! Conventionalisdo" &or UI design dictates that lling the ti"e ith vis#al

&eedback not only signicantly i"proves the #sability, b#t alsoprevents #sers &ro" adversely intervening an ongoing processthat cannot be easily recoverable! 4or these reasons, MiPadadopts a vis#al &eedback &or speech inp#ts!

2.% User Stud, &esults-

he goal is to "ake MiPad prod#ce real val#e to #sers! It isnecessary to have a rigoro#s eval#ation to "eas#re the #sabilityo& the prototype! 3#r "aDor concerns areF)s the task completion time much better/ and )s it easier toget the 0ob done/


21/22

52 s t %asier to 6et The *ob +one/: 4i&teen o#t o& the 1Aparticipants in the eval#ation stated that they pre&erred #sing

the Tap & Talk inter&ace &or creating ne appoint"ents and all1A said they pre&erred it &or riting longer e"ails!


22/22

5. &eferences-

Documents

Distributed Speech Processing in MiPad’sMultimodal User Interface