Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007

Tutorial

Developing and Deploying Multimodal Applications

James A. LarsonLarson Technical Services

jim @ larson-tech.com

SpeechTEK WestFebruary 23, 2007

James A. Larson Developing & Delivering Multimodal Applications 2


What applications should be multimodal?

What is the multimodal application development process?

What standard languages can be used to develop multimodal applications?

What standard platforms are available for multimodal applications?


Capturing Input from the User

Acoustic

Tactile

Visual

Microphone

Keypad

Keyboard

Pen

Joystick

Scanner

Still camera

Video camera

Speech

Key

Ink

GUI

Photograph

Movie

Mouse

Medium Input Device Mode


Capturing Input From the User Multimodal

Acoustic

Tactile

Visual

Microphone

Keypad

Keyboard

Pen

Joystick

Scanner

Still camera

RFID

Speech

Key

Ink

GUI

Photograph

Gaze trackingGesture reco

Mouse

Medium Input Device Mode

Electronic

Video camera

Biometric

GPS

Digital data


Presenting Output to the User

Acoustic

Visual

Speaker

Display

Speech

Text

Photograph

Movie

Medium Output Device Mode

Tactile Joystick Pressure


Presenting Output to the User

Acoustic

Visual

Speaker

Display

Speech

Text

Photograph

Movie

Medium Output Device Mode

Tactile Joystick Pressure

Multimedia


Multimodal and Multimedia Application Benefits

Provide a natural user interface by using multiple channels for user interactions

Simplify interaction with small devices with limited keyboard and display, especially on portable devices

Leverage advantages of different modes in different contexts

Decrease error rates and time required to perform tasks

Increase accessibility of applications for special users

Enable new kinds of applications


Exercise 1

What new multimodal applications would be useful for your work?

What new multimodal applications would be entertaining to you, your family, or friends?


Voice as a “Third Hand”

Game Commander 3

• http://www.gamecommander.com/


Voice-Enabled Games

Scansoft’s VoCon Games Speech SDK

• http://www.scansoft.com/games/

• PlayStation® 2

• Nintendo® GameCube™

• http://www.omnipage.com/games/poweredby/


Education

Tucker Maxon School of Oral Educationhttp://www.tmos.org/


Education

Reading Tutor Projecthttp://cslr.colorado.edu/beginweb/reading/reading.html


Multimodal Applications Developed by PSU and OHSU Students

Hands-busy

Troubleshooting a car’s motor

Repairing a leaky faucet

Tune musical instruments

Construction

Complex origami artifactProject book for children

Cooking—Talking recipe book

Entertainment

Child’s fairy tale bookAudio-controlled juke boxGames (Battleship, Go)


Multimodal Applications Developed by PSU and OHSU Students (continued)

Data collection

Buy a carCollect health dataBuy movie ticketsOrder meals from a restaurantConduct banking businessLocate a businessOrder a computerChoose homeless pets from an animal shelter

AuthoringPhoto album tour

Education

Flash cards—Addition tables

Download Opera and the speech plug-inGo to www.larson-tech.com/mm-Projects/Demos.htm


New Application Classes

Active listening

Verbal VCR controls: start, stop, fast forward, rewind, etc.

Virtual assistants

Listen for requests and immediately perform them

- Violin tuner- TV Controller- Environmental controller- Family-activity coordinator

Synthetic experiences

Synthetic interviewsSpeech-enabled gamesEducation and training

Authoring content


Two General Uses of Multiple Modes of Input

Redundancy—One mode acts as backup for another mode

In noisy environments, use keypad instead of speech input.

In cold environments, use speech instead of keypad.

Complementary—One mode supplements another mode

Voice as a third hand

“Move that (point) to there (point)” (late fusion)

Lip reading = video + speech (early fusion)


Potential Problems with Multimodal Applications

Voice may make an application “noisy.”

• Privacy and security concerns

• Noise pollution

Sometimes speech and handwriting recognition systems fail.

False expectations of users wanting to use natural language.





• Noise pollution



Full natural language processing requires:• Knowledge of outside world• History of the user-computer interaction• Sophisticated understanding of language structure “Natural language-like” simulates natural language for a small domain, short history, and specialized language structures





• Noise pollution



Full “natural language” processing requires:• Knowledge of outside world• History of the user-computer interaction• Sophisticated understanding of language structure “Natural language-like” simulates natural language for a small domain, short history, and specialized language structures.

Possible only on Star Trek

Incorrectly called “NLP”


Adding a New Mode to an Application

Only if…

The new mode enables new features not previously possible.

The new modes dramatically improves the usability

Always….

Redesign the application to take advantage of the new mode.

Provide backup for the new mode.

Test, test, and test some more.


Exercise 2

Where will multimodal applications be used?

A. At home

B. At work

C. “On the road”

D. Other?








The Playbill—Who’s Who on the Team

Users—Their lives will be improved by using the multimodal application

Interaction designer—Designs the dialog—when and how the user and system interchange requests and information

Multimodal programmer—Implements VUI

Voice talent—Records spoken prompts and messages

Grammar writer—Specifies words and phrases the user may speak in response to a prompt

TTS specialist—Specifies verbal and audio sounds and inflections

Quality assurance specialist—Performs tests to validate the application is both useful and usable

Customer—Pays the bills

Program manager—Organizes the work and makes sure it is completed according to schedule and under budget


Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Each stage involves users

Iterative refinement


Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Identify the Application• Conduct ethnography studies• Identify candidate applications• Conduct focus groups• Select the application



Exercise 3

What will be the “killer” consumer multimodal applications?


Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Specify the Application• Construct the conceptual model• Construct scenarios• Specify performance and preference requirements


Specify Performance and Preference Requirements

Is the application useful? Is the application enjoyable?

Performance Preference

Measure what the users actually accomplished.

Validate that the users achieved success.

Measure users’ likes and dislikes.

Validate that the users enjoyed the application and will use it

again again.


Performance Metrics

User Task Measure Typical Criteria

Speak a command Word error rate Less than 3%

The caller supplies values into a form

Enters valid values into each field of a form

< 5 seconds per value

Navigate a list The user successfully selects the specified option.

Greater than 95%

Purchase a product The user successfully completes the purchase option.

Greater than 93%


Exercise 4

User Task Measure Typical Criteria

Specify performance metrics for the multimodal email application


Preference Metrics

Question Typical Criteria

On a scale from 1 to 10, rate the help facility.

The average caller score is greater than 8.

On a scale from 1 to 10, rate the ease of use of this application.

The average caller score is greater than 8.

Would you recommend using this voice portal to a friend?

Over 80% of callers respond by saying “yes.”

What would you be willing to pay to each time you use this application?

Over 80% of callers indicate that they are willing to pay $1.00 or more per use.


Exercise 5

Question Typical Criteria

Specify preference metrics for the multimodal email application


Preference Metrics (Open-ended Questions)

What did you like the best about this voice-enabled application? (Do not change these features.)

What did you like the least about this voice-enabled application? (Consider changing these features.)

What new features would you like to have added? (Consider adding these features in this or a later release.)

What features do you think you will never use? (Consider deleting these features.)

Do you have any other comments and suggestions? (Pay attention to these responses. Callers frequently suggest very useful ideas.)


Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Develop the Application• Specify the persona• Specify the modes and modalities• Specify the dialog script


UI Design Guidelines

Guidelines for Voice User Interfaces

• Bruce Balentine and David P. Morgan. How to Build a Speech Recognition Application, Second Edition. http://www.eiginc.com

Guidelines for Graphical User Interfaces

• Research-Based Web Design and Usability Guidelines. U.S. Department of Health and Human Services. http://www.usability.gov/pdfs/guidelines.html

Guidelines for Graphical User Interfaces

• Common Sense Guidelines for Developing Multimodal User Interfaces.W3C Working Group Note. 19 April 2006 http://www.w3.org/2002/mmi/Group/2006/Guidelines/


Common-sense Suggestions1. Satisfy Real-World Constraints

Task-oriented Guidelines

1.1. Guideline: For each task, use the easiest mode available on the device.

Physical Guidelines

1.2. Guideline: If the user’s hands are busy, then use speech.

1.3. Guideline: If the user’s eyes are busy, then use speech.

1.4. Guideline: If the user may be walking, use speech for input.

Environmental Guidelines

1.5. Guideline: If the user may be in a noisy environment, then use a pen, keys or mouse.

1.6. Guideline: If the user’s manual dexterity may be impaired, then use speech.


Exercise 6

What input mode(s) should be used for each of the following tasks?

A. Selecting objects

B. Entering text

C. Entering symbols

D. Enter sketches or illustrations


Common-sense Suggestions2. Communicate Clearly, Concisely, and Consistently with Users

Consistency Guidelines

2.1. Phrase all prompts consistently.

2.2. Enable the user to speak keyword utterances rather than natural language sentences.

2.3. Switch presentation modes only when the information is not easily presented in the current mode.

2.4. Make commands consistent.

2.5. Make the focus consistent across modes.

Organizational Guidelines

2.6. Use audio to indicate the verbal structure.

2.7. Use pauses to divide information into natural “chunks.”

2.8. Use animation and sound to show transitions.

2.9. Use voice navigation to reduce the number of screens.

2.10. Synchronize multiple modalities appropriately.

2.11. Keep the user interface as simple as possible.


Common-sense Suggestions3. Help Users Recover Quickly and Efficiently from Errors

Conversational Guidelines

3.1. Users tend to use the same mode that was used to prompt them.

3.2. If privacy is not a concern, use speech as output to provide commentary or help.

3.3. Use directed user interfaces, unless the user is always knowledgeable and experienced in the domain.

3.4 Always provide context-sensitive help for every field and command.


Common-sense Suggestions3. Help Users Recover Quickly and Efficiently from Errors (Continued)

Reliability GuidelinesOperational status

3.5. The user always should be able to determine easily if the device is listening to the user.

3.6. For devices with batteries, users always should be able to determine easily how much longer the device will be operational.

3.8. Support at least two input modes so one input mode can be used when the other cannot.

Visual feedback

3.8. Present words recognized by the speech recognition system on the display, so the user can verify they are correct.

3.9. Display the n-best list to enable easy speech recognition error correction

3.10. Try to keep response times less than 5 seconds. Inform the user of longer response times.


Common-sense Suggestions4. Make Users Comfortable

Listening mode

4.1. Speak after pressing a speak key. which automatically releases after the user finishes speaking.

System Status 4.2. Always present the current system status to the user.

Human-memory Constraints

4.3. Use the screen to ease stress on the user’s short-term memory.


Common-sense Suggestions4. Make Users Comfortable (Continued)

Social Guidelines 4.4. If the user may need privacy, use a display rather than render speech.

4.5. If the user may need privacy, use a pen or keys.

4.6. If the device may be used during a business meeting, then use a pen or keys (with the keyboard sounds turned off).

Advertising Guidelines4.7. Use animation and sound to attract the user’s attention.

4.8. Use landmarks to help the know where he is.


Common-sense Suggestions4. Make Users Comfortable (continued)

Ambience

4.9 Use audio and graphic design to set the mood and convey emotion in games and entertainment applications.

Accessibility

4.10 For each traditional output technique, provide an alternative output technique.

4.11. Enable users to adjust the output presentation.


Books

Ramon Lopez-Cozar Delgado and Masahiro Araki. Spoken, Multilingual and Multimodal Dialog Systems—Development and Assessment. West Sussex, England: Wiley, 2005.

Julie A. Jacko and Andrew Sears (Editors) The Human-Computer Interaction Handbook—Fundamentals, Evolving technologies, and Emerging Applications. Mahwah, New Jersey: Lawrence Erlbaum Associates, 2003.


Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Test The Application• Component test• Usability test• Stress test• Field test


Testing Resources

Jeffrey Rubin. Handbook of Usability Testing. New York: Wiley Technical Communication Library, 1994.

Peter and David Leppik. Gourmet Customer Service. Eden Prairie, MN: VocalLabs, 2005. [email protected]


Development Process

Investigation Stage

Design Stage

Development Stage

Testing Stage

Sustaining Stage

Deploy and Monitor the Application• User Survey• Usage reports from log files• User feedback and comments








W3C Multimodal Interaction Framework

Recognition Grammar

Semantic Interpretation

Extended Multimodal Annotation (EMMA)

Speech Synthesis

Interaction Managers

General description of speech application

components and how they relate


InteractionManager

ApplicationFunctions

TelephonyProperties


Input

Output


ASRSemantic

Interpretation

InformationIntegration

InteractionManager

TTSLanguage

Generation


User

Ink

Media Planning

AudioTelephonyFunctions


Display



ASRSemantic

Interpretation


InteractionManager

TTSLanguage

Generation


User

Ink

Media Planning


Display

SRGS: Describe what the user may say at each point in the dialog


Speech Recognition Engines

Low-end High-end Other

Speaking mode Isolated (discrete) Continuous Keywords

Enrollment Speaker dependent

Speaker independent

Adaptive

Vocabulary size Small Large Switch vocabularies

Speaking style Read Spontaneous

Number of simultaneous callers

Single-threaded Multi-threaded


Speech Recognition Engines

Low-end High-end Other

Speaking mode Isolated (discrete) Continuous Keywords

Enrollment Speaker dependent

Speaker independent

Adaptive

Vocabulary size Small Large Switch vocabularies

Speaking style Read Spontaneous

Number of simultaneous callers

Single-threadedMulti-threaded


Grammars

Describe what the user may say or handwrite at a point in the dialog

Enable the recognition engine to work faster and more accurately

Two types of grammars:– Structured Grammar– Statistical Grammar (N-grams)


Structured Grammars

Specifies words that a user may speak or write

Two representation formats

1. Backus-Naur format (ABNF) Production Rules

Single_digit ::= zero | one | two | … | nine Zero_thru_ten ::= Single_digit | ten

2. XML format Can be processed by XML validater


Example XML Grammar

<grammar mode = "voice" type = "application/srgs+xml" root = "zero_to_ten“>

<rule id = "zero_to_ten"> <one-of> <ruleref uri = "#single_digit"/> <item> ten </item> </one-of></rule>

<rule id = "single_digit"> <one-of> <item> zero </item> <item> one </item> <item> two </item> <item> three </item> <item> four </item> <item> five </item> <item> six </item> <item> seven </item> <item> eight </item> <item> nine </item> </one-of> </rule></grammar>


Exercise 7

Write a grammar that recognizes the digits zero through nineteen

(Hint: Modify the previous page)


Reusing Existing Grammars

<grammar

type = "application/srgs+xml" root = "size " src = "http://www.example.com/size.grxml"/>


Exercise 8

Write a grammar for positive responses to a yes/no question (i.e., “yes,” “sure,” “affirmative,” and so forth)


When Is a Grammar Too Large?

WordCoverage

Response



ASRSemantic

Interpretation


InteractionManager

TTSLanguage

Generation


User

Ink

Media Planning


Display

SISR: A procedural JavaScript-like language for interpreting the text strings returned by the speech synthesis engine



Semantic scripts employ ECMAScript

Advantages:– Translate aliases to vocabulary words– Perform calculations– Produces a rich structure rather than a text string



Recognizer

ConversationManager

Large white t-shirt

Big white t-shirt

Grammar



Recognizer

Grammar withSemantic

InterpretationScripts

SemanticInterpretation

Processor

ConversationManager

<rule id = "action"> <one-of> <item> small <tag> out.size = "small"; </tag> </item> <item> medium <tag> out.size = "medium"; </tag> </item>

<item> large <tag> out.size = "large"; </tag> </item> <item> big <tag> out.size = "large"; </tag> </item> </one-of> <one-of> <item> green <tag> out.color = "green"; </tag> </item> <item> blue <tag> out.color = "blue"; </tag> </item> <item> white <tag> out.color = "white"; </tag> </item> </one-of></rule>

Big white t-shirt

{ size: large color: white}


Exercise 9 Modify this rule to return only “yes”

<grammar type = "application/srgs+xml" root = "yes" mode = "voice">

<rule id = "yes"> <one-of> <item> yes </item> <item> sure </item> <item> affirmative </item>

…

</one-of> </rule>

</grammar>



ASRSemantic

Interpretation


InteractionManager

TTSLanguage

Generation


User

Ink

Media Planning


Display

EMMA: A language for representing the semantic content from speech recognizers, handwriting recognizers, and other input devices


EMMA

Extensible MultiModal Annotation markup language

Canonical structure semantic interpretations for a variety of inputs including:

• Speech

• Natural language text

• GUI

• Ink


EMMA

Keyboard Interpretation

SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic

InterpretationInstructions


Applications


EMMA


SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic



Applications

<interpretation mode = "speech"> <travel> <to hook="ink"/> <from hook="ink"/> <day> Tuesday </day> </travel></interpretation>


EMMA


SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic



Applications


<interpretation mode = "ink"> <travel> <to>Las Vegas </to> <from>Portland </from> </travel></interpretation>



<interpretation mode = "ink"> <travel> <to>Las Vegas </to> <from>Portland </from> </travel></interpretation>

EMMA


SpeechRecognition

Merging/Unification

Speech Keyboard

EMMA EMMA

EMMA

Grammar+ Semantic



Applications

<interpretation mode = "interp1"> <travel> <to> Las Vegas </to> <from> Portland </from> <day> Tuesday </day> </travel></interpretation>


Exercise 10

<interpretation mode = "speech"> <moneyTransfer> <sourceAcct hook="ink"/> <targetAcct hook="ink"/> <amount> 300 </amount> </moneyTransfer></interpretation>

<interpretation mode = "ink"> <moneyTransfer> <sourceAcct> savings </sourceAcct> <targetAcct> checking</targetAcct> </moneyTransfer></interpretation>

Given the following two EMMA specifications, what is the unified EMMA specification?

<interpretation mode ="intp1"> <moneyTransfer> <sourceAcct> ______ </sourceAcct> <targetAcct> _______</targetAcct> <amount> ______ </amount> </moneyTransfer></interpretation>

Unified EMMA specification:



ASRSemantic

Interpretation


InteractionManager

TTSLanguage

Generation


User

Ink

Media Planning


Display

SSML: A language for rendering text as synthesized speech


Speech Synthesis Markup Language

StructureAnalysis

TextNormali-

zation

Text-to-Phoneme

Conversion

Prosody Analysis

WaveformProduction

Markup support:emphasis, break, prosodyNon-markup behavior:automatically generate prosody through analysis of document structure andsentence syntax

Markup support:phoneme, sayasNon-markup behavior:look up in pronunciation dictionary

Markup support: sayas for dates, times, etc.Non-markup behavior: automatically identify and convert constructs

Markup support:paragraph, sentenceNon-markup behavior:infer structure byautomated text analysis


Speech Synthesis Markup LanguageExamples

<phoneme alphabet="ipa" ph="wɪnɛfɛks"> WinFX </phoneme>is a great platform

<prosody pitch = "x-low"> Who’s been sleeping in my bed? </prosody> said papa bear. <prosody pitch = "medium"> Who’s been sleeping in my bed? </prosody> said momma bear. <prosody pitch = "x-high"> Who’s been sleeping in my bed? </prosody> said baby bear.


Popular Strategy

Develop dialogs using SSML

Usability test dialogs

Extract prompts

Hire voice talent to record prompts

Replace <prompt> with <audio>



ASRSemantic

Interpretation


InteractionManager

TTSLanguage

Generation


User

Ink

Media Planning


Display

VXML: A language for controlling the exchange of information and commands between the user and the system








Speech APIs and SDKs

• JSAPI—Java Speech Application Program Interface– http://java.sun.com/products/java-media/speech/– http://developer.mozilla.org/en/docs/JSAPI_Reference

• Nuance Mobil Speech Platform– http://www.nuance.com/speechplatform/components.asp

• VSAPI—Voice Signal API– http://www.voicesignal.com/news/articles/2006-06-21-SymbianOne.htm

• SALT– http://www.saltforum.org/


Interaction Manager Approaches

Interaction Manager(XHTML)

VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

X+VObject-oriented

Interaction Manager(SCXML)

XHTML

VoiceXML 3.0

InkML

W3C



X+V


XHTML

VoiceXML 3.0

InkML

W3C


VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

Object-oriented


SAPI 5.3 & Windows Vista™Speech Synthesis

W3C Speech Synthesis Markup Language 1.0

<speak> <phoneme alphabet="ipa" ph="wɪnɛfɛks">

WinFX </phoneme>

is a great platform</speak>

Microsoft proprietary PromptBuilder

myPrompt.AppendTextWithPronunciation ("WinFX", "wɪnɛfɛks");

myPrompt.AppendText("is a great platform.");

Interaction Manager

(C#)

SAPI 5.3

Object-oriented


SAPI 5.3 & Windows Vista™Speech Recognition

W3C Speech Recognition Grammar Specification 1.0

<grammar type="application/srgs+xml" root= "city" mode="voice"><rule id = "city">

<one-of><item> New York City </item><item> New York </item><item> Boston </item>

</one-of></rule>

</grammar>

Microsoft proprietary Grammar Builder

Choices cityChoices = new Choices();cityChoices.AddPhrase ("New York City");cityChoices.AddPhrase ("New York");cityChoices.AddPhrase ("Boston");Grammar pizzaGrammar

= new Grammar (new GrammarBuilder(pizzaChoices));


SAPI 5.3 & Windows Vista™Semantic Interpretation

Augment SRGS grammar with Jscript® for semantic interpretation

<grammar type="application/srgs+xml" root= "city" mode="voice"><rule id = "city">

<one-of><item> New York City <tag> city="JFK" </tag></item><item> New York <tag> city = "JFK" </tag> </item><item> Portland <tag> city = "PDX" </tag></item>

</one-of></rule>

</grammar>

User-Specified “Shortcuts” recognizer replaces “shortcut word”by expanded string

User says: my address

System: 1033 Smith Street, Apt. 7C, Bloggsville 00000


SAPI 5.3 & Windows Vista™Dialog

1. Introduce the System Speech.Recognition namespace

2. Instantiate a SpeechRecognizer object

3. Build a grammar

4. Attach an event handler

5. Load the grammar into the recognizer

6. When the recognizer hears something that fits the grammar, the SpeechRecognized event handler is invoked, which accesses the Result object and works with the recognized text


SAPI 5.3 & Windows Vista™Dialogusing System;

using System.Windows.Forms;

using System.ComponentModel;

using System.Collections.Generic;

using System.Speech.Recognition;

namespace Reco_Sample_1

{

public partial class Form1 : Form

{

//create a recognizer

SpeechRecognizer _recognizer = new SpeechRecognizer();

public Form1() { InitializeComponent(); }

private void Form1_Load(object sender, EventArgs e)

//Create a pizza grammar

Choices pizzaChoices = new Choices();

pizzaChoices.AddPhrase("I'd like a cheese pizza");

pizzaChoices.AddPhrase("I'd like a pepperoni pizza");

{

pizzaChoices.AddPhrase("I'd like a large pepperoni pizza");

pizzaChoices.AddPhrase(

"I'd like a small thin crust vegetarian pizza");

Grammar pizzaGrammar =

new Grammar(new GrammarBuilder(pizzaChoices));

//Attach an event handler

pizzaGrammar.SpeechRecognized +=

new EventHandler<RecognitionEventArgs>(

PizzaGrammar_SpeechRecognized);

_recognizer.LoadGrammar(pizzaGrammar);

}

void PizzaGrammar_SpeechRecognized(

object sender, RecognitionEventArgs e)

{

MessageBox.Show(e.Result.Text);

}

}

}


SAPI 5.3 & Windows Vista™References

Speech API Overview

http://msdn2.microsoft.com/en-us/library/ms720151.aspx#API_Speech_Recognition

Microsoft Speech API (SAPI) 5.3

http://msdn2.microsoft.com/en-us/library/ms723627.aspx

“Exploring New Speech Recognition And Synthesis APIs In Windows Vista” by Robert Brown

http://msdn.microsoft.com/msdnmag/issues/06/01/speechinWindowsVista/default.aspx#Resources



X+V


XHTML

VoiceXML 3.0

InkML

W3C


VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

Object-oriented


Step 1: Start with Standard VoiceXML and Standard XHTMLVoiceXML

<form id="topform"> <field name="city"> <prompt>Say a name</prompt> <grammar src="city.grxml"/> </field> </form>

XHTML

<form> Result: <input type="text" name="in1"/> </form>

W3C grammar language


Step 2: Combine<html xmlns="http://www.w3.org/1999/xhtml">

<head> <form id="topform"> <field name="city"> <prompt>Say a name</vxml:prompt> <grammar src ="city.grxml"/> </field></form></head>

<body <form> Result: <input type="text" name="in1"/> </form></body>

</html>


Step 3: Insert vxml Namespace

<html xmlns="http://www.w3.org/1999/xhtml"

xmlns:vxml="http://www.w3.org/2001/vxml">

<head> <vxml:form id="topform"> <vxml:field name="city"> <vxml:prompt>Say a name</vxml:prompt> <vxml:grammar ="city.grxml"/> </vxml:field> </vxml:form></head>

<body> <form> Result: <input type="text" name="in1"/ </form></body>

</html>


Step 4: Insert event

<html xmlns=http://www.w3.org/1999/xhtml xmlns:vxml=http://www.w3.org/2001/vxml xmlns:ev="http://www.w3.org/2001/xml-events">

<head> <vxml:form id="topform"> <vxml:field name="city"> <vxml:prompt>Say a name</vxml:prompt> <vxml:grammar src ="city.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#topform"> Result: <input type="text" name="in1"/> </form></body>

</html>


Step 5: Insert <sync><html xmlns=http://www.w3.org/1999/xhtml xmlns:vxml=http://www.w3.org/2001/vxml xmlns:ev=http://www.w3.org/2001/xml-events xmlns:xv="http://www.w3.org/2002/xhtml+voice">

<head> <xv:sync xv:input="in1" xv:field="#result"/> <vxml:form id="topform"> <vxml:field name="city" xv:id="result"> <vxml:prompt>Say a name</vxml:prompt> <vxml:grammar src ="city.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#topform"> Result: <input type="text" name="in1"/> </form></body>

</html>


XHTML plus Voice (X+V) References

• Available on– ACCESS Systems’ NetFront Multimodal Browser for PocketPC 2003

http://www-306.ibm.com/software/pervasive/ multimodal/?Open&ca=daw-prod-mmb

– Opera Software Multimodal Browser for Sharp Zaurushttp://www-306.ibm.com/software/pervasive/ multimodal/?

Open&ca=daw-prod-mmb– Opera 9 for Windows

http://www.opera.com/

• Programmers Guide– ftp://ftp.software.ibm.com/software/pervasive/info/multimodal /

XHTML_voice_programmers_guide.pdf

• For a variety of small illustrative applications– http://www.larson-tech.com/MM-Projects/Demos.htm


Exercise 11

Specify the X+V notation for integrating the following VoiceXML and XHTML code by completing the code on the next page

VoiceXML

<form id="stateForm"> <field name="state"> <prompt>Say a state name</prompt> <grammar src="city.grxml"/> </field> </form>

XHTML

<form> Result: <input type="text" name="in1"/> </form>


Exercise 11 (continued)

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:vxml="http://www.w3.org/2001/vxml" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:xv="http://www.w3.org/2002/xhtml+voice">

<head> <xv:sync xv:input="_______" xv:field="________"/> <vxml:form id="________"> <vxml:field name="state" xv:id="________“> <vxml:prompt>Say a state name</vxml:prompt> <vxml:grammar src ="state.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#________"> Result: <input type="text" name="_______"/> </form></body>

</html>



X+V


XHTML

VoiceXML 3.0

InkML

W3C


VoiceXML 2.0Modules

Interaction Manager

(C#)

SAPI 5.3

Object-oriented


MMI Architecture—4 Basic Components

• Runtime Framework or Browser— initializes application and interprets the markup

• Interaction Manager—coordinates modality components and provides application flow

• Modality Components—provide modality capabilities such as speech, pen, keyboard, mouse

• Data Model—handles shared data

Interaction Manager (SCXML)

XHTML

VoiceXML 3.0

InkML

DataModel


Multimodal Architecture and Interfaces

• A loosely-coupled, event-based architecture for integrating multiple modalities into applications

• All communication is event-based

• Based on a set of standard life-cycle events

• Components can also expose other events as required

• Encapsulation protects component data

• Encapsulation enhances extensibility to new modalities

• Can be used outside a Web environment

XHTML

VoiceXML 3.0

InkML


DataModel


Specify Interaction Manager Using Harel State Charts

Extension of state transition systems

• States

• Transitions

• Nested state-transition systems

• Parallel state-transition systems

• History

PrepareState

StartState

WaitState

EndState

FailState

PrepareResponse(success)

StartResponse

DoneSuccess

StartFail

DoneFail

PrepareResponse(fail)


Example State Transition System

State Chart XML (SCXML)

…

<state id="PrepareState">

<send event="prepare" contentURL="hello.vxml"/>

<transition event="prepareResponse" cond="status='success'" target="StartState"/>

<transition event="prepareResponse" cond="status='failure'" target="FailState"/>

</state>

…

PrepareState

StartState

WaitState

EndState

FailState

PrepareResponse(success)

StartResponse

DoneSuccess

StartFail

DoneFail

PrepareResponse(fail)


Example State Chart with Parallel States

PrepareVoice

StartVoice

WaitVoice

EndVoice

Fail Voice

PrepareResponse

Success

StartResponse

DoneSuccess

Start Fail

Done Fail

PrepareGUI

StartGUI

WaitGUI

EndGUI

Fail GUI

PrepareResponse

Success

StartResponse

DoneSuccess

Start Fail

Done Fail

PrepareResponseFail

PrepareResponseFail


The Life Cycle EventsInteractionManager

GUI VUI

prepareprepare

prepareResponse prepareResponse

InteractionManager

GUI VUI

startstart

startResponse startResponse

InteractionManager

GUI VUI

cancelcancel

cancelResponse cancelResponse

InteractionManager

GUI VUI

pausepause

pauseResponse pauseResponse

InteractionManager

GUI VUI

resumeresume

resumeResponse resumeResponse


More Life Cycle Events

InteractionManager

GUI VUI

newContextRequestnewContextRequest

newContextResponse newContextResponse

InteractionManager

GUIVUI

data data

InteractionManager

GUIdone

InteractionManager

GUI VUI

clearContextclearContext


Synchronization Using the Lifecycle Data Event

• Intent-based events– Capture the underlying intent

rather than the physical manifestation of user-interaction events

– Independent of the physical characteristics of particular devices

• Data/reset– Reset one or more field values to

null

• Data/focus– Focus on another field

• Data/change– Field value has changed

InteractionManager

GUI VUIdata data


Interaction Manager

Lifecycle Events between Interaction Manager and Modality

Modality

PrepareState

StartState

WaitState

EndState

FailState

PrepareResponseSuccess)

StartResponse

DoneSuccess

Start Fail

DoneFail

PrepareResponseFail

prepare

prepare response (success)

start

start response (success)

data

done

prepare response (failure)

start response (failure)


MMI Architecture Principles

• Runtime Framework communicates with Modality Components through asynchronous events

• Modality Components don’t communicate directly with each other, but indirectly through the Runtime Framework

• Components must implement basic life cycle events, may expose other events

• Modality components can be nested (e.g. a Voice Dialog component like a VoiceXML <form>)

• Components need not be markup-based

• EMMA communicates users’ inputs to the Interaction Manager


Modalities

• GUI Modality (XHTML)– Adapter converts Lifecycle

events to XHTML events– XHTML events converted to

lifecycle events

XHTML

VoiceXML 3.0


DataModel

• Voice Modality (VoiceXML 3.0)– Lifecyle events are embedded

into VoiceXML 3.0


Exercise 12

What should VoiceXML do when it receives each of the following events?

A. Reset

B. Change

C. Focus


ModalitiesVoiceXML 3.0 will support lifecycle events.

<form> <catch name="change"> <assign name="city" value="data"/> </catch>

…

<field name = "city"> <prompt> Blah </prompt> <grammar src="city.grxml"/> <filled> <send event="data.change" data="city"/> </filled> </field>

</form>

XHTML

VoiceXML 3.0


DataModel


Exercise 13

What should HTML do when it receives each of the following events?

A. Reset

B. Change

C. Focus


ModalitiesXHTML is extended to support lifecycle eventssent to a modality.

<head>…<ev:Listener ev:event="onChange" ev:observer="app1" ev:handler="onChangeHandler()";>…<script>{function onChangeHandler() post ("data", data="city")}</script></head>

…

<body id="app1"? <input type="text" id=city "value= " "/></body>

…

XHTML

VoiceXML 3.0


DataModel


ModalitiesXHTML is extended to support lifecycle eventssent to the interaction manager

<head>…<handler type="text/javascript“ ev:event="data" if (event="change" {document.app1.city.value="data.city"}</handler>…</head>

…

<body id="app1"? <input type="text" id="city" value=" "/>

</body>…

XHTML

VoiceXML 3.0


DataModel


References

• SCXML– Second working draft available at

http://www.w3.org/TR/2006/WD-scxml-20060124/– Open Source available from

http://jakarta.apache.org/commons/sandbox/scxml/

• Multimodal Architecture and Interfaces – Working draft available at http://www.w3.org/TR/2006/WD-mmi-arch-

20060414/

• Voice Modality– First working draft VoiceXML 3.0 scheduled for November 2007

•XHTML– Full recommendation– Adapters must be hand-coded

• Other modalities– TBD


Comparison

Object-oriented X+V W3C

Standard Languages SRGS VoiceXML SCXMLSISR SRGS SRGSSSML SSML VoiceXML

SISR SSMLXHTML SISR

XHTMLEMMA

CCXML

Interaction Manager C# XHTML SCXML

Modes GUI GUI GUISpeech Speech Speech

Ink …


Availability

SAPI 5.3– Microsoft Windows Vista®

X+V – ACCESS Systems’ NetFront Multimodal Browser for PocketPC 2003

http://www-306.ibm.com/software/pervasive/multimodal/?Open&ca=daw-prod-mmb

– Opera Software Multimodal Browser for Sharp Zaurushttp://www-306.ibm.com/software/pervasive/

multimodal/?Open&ca=daw-prod-mmb– Opera 9 for Windows

http://www.opera.com/

W3C– First working draft of VoiceXML 3.0 not yet available– Working drafts of SCXML are available; some open-source implementations are

available

Proprietary APIs– Available from vendor


Discussion Question

Should a developer insert SALT tags or X+V modules into an existing Web page without redesigning the Web page?


Conclusion

•Multimodal applications offer benefits over today’s traditional GUIs.

•Only use multimodal if there is a clear benefit.

•Standard languages are available today to develop multimodal applications.

•Don’t reinvent the wheel.

•Creativity and lots of usability testing are necessary to create world-class multimodal applications.


Web Resources

http://www.w3.org/voice

– Specification of grammar, semantic interpretation, and speech synthesis languages

http://www.w3.org/2002/mmi

– Specification of EMMA and InkML languages

http:/www.microsoft.com (and query SALT)

– SALT specification and download instructions for adding SALT to Internet Explorer

http://www-306.ibm.com/software/pervasive/multimodal/

– X+V specification; download Opera and ACCESS browsers

http://www.larson-tech.com/SALT/ReadMeFirst.html

– Student projects using SALT to develop multimodal applications

http://www.larson-tech.com/MMGuide.html or http://www.w3.org/2002/mmi/Group/2006/Guidelines/

– User interface guidelines for multimodal applications


Working Draft

Recommendation

Status of W3C Multimodal Interface Languages

Proposed Recommendation

CandidateRecommendation

Last CallWorking Draft

Requirements

VoiceXML 2.0

SpeechRecog-nition

GrammarFormat(SRGS)

1.0

SpeechSynthesisMarkup

Language(SSML)

1.0 ExtendedMulti-modal

Interaction(EMMA)

1.0

SemanticInterpret-

ationof

SpeechRecog-nition(SISR)

1.0

StateChartXML

(SCXML)1.0

InkXL1.0

VoiceXML 2.1


Questions

?


Answer to Exercise 5

Content- manipulation task

Voice Pen Keyboard/keypad

Mouse/joystick

Select objects

(3) Speak the name of the object

(1) Point to or circle the object

(4) Press keys to position the cursor on the object and press the select key

(2) Point to and click on the object or drag to select text

Enter text (2) Speak the words in the text

(3) Write the text (1) Press keys to spell the words in the text

(4) Spell the text by selecting letters from a soft keyboard

Enter symbols (3) Say the name of the symbol and where it should be placed.

(1) Draw the symbol where it should be placed

(4) Enter one or more characters that together represent the symbol

(2) Select the symbol from a menu and indicate where it should be placed

Enter sketches or illustrations

(2) Verbally describe the sketch or illustration

(1) Draw the sketch or illustration

(4) Impossible (3) Create the sketch by moving the mouse so it leaves a trail (similar to an Etch-a-Sketch™)


Answer to Exercise 7Write a grammar for zero to nineteen

<grammar type = "application/srgs+xml" root = "zero_to_19" mode = "voice">

<rule id = "zero_to_19"> <one-of> <ruleref uri = "#single_digit"/>

<ruleref uri ="#teens">

</one-of></rule>

<rule id = "single_digit"> <one-of> <item> zero </item> <item> one </item> <item> two </item> <item> three </item> <item> four </item> <item> five </item> <item> six </item> <item> seven </item> <item> eight </item> <item> nine </item> </one-of></rule>

<rule id = "#teens"> <one-of> <item> ten</item> <item> eleven </item> <item> twelve </item> <item> thirteen </item> <item> fourteen </item> <item> fifteen </item> <item> sixteen </item> <item> seventeen </item> <item> eighteen </item> <item> nineteen </item> </one-of> </rule></grammar>




<rule id = "yes"> <one-of> <item> yes </item> <item> sure </item> <item> affirmative </item>

…

</one-of> </rule>

</grammar>




<rule id = "yes"> <one-of> <item> yes </item> <item> sure <tag> out = "yes" </tag> </item> <item> affirmative <tag> out = "yes" </tag> </item>

…

</one-of> </rule>

</grammar>



<interpretation mode = "speech"> <moneyTransfer> <sourceAcct hook="ink"/> <targetAcct hook="ink"/> <amount> 300 </amount> </moneyTransfer></interpretation>

<interpretation mode = "ink"> <moneyTransfer> <sourceAcct> savings </sourceAcct> <targetAcct> checking</targetAcct> </moneyTransfer></interpretation>

Given the following two EMMA specifications, what is the unified EMMA specification?

<interpretation mode = "intp1"> <moneyTransfer> <sourceAcct> savings </sourceAcct> <targetAcct> checking</targetAcct> <amount> 300 </amount> </moneyTransfer></interpretation>



<html xmlns= "http://www.w3.org/1999/xhtml" xmlns:vxml= "http://www.w3.org/2001/vxml" xmlns:ev= "http://www.w3.org/2001/xml-events" xmlns:xv="http://www.w3.org/2002/xhtml+voice">

<head> <xv:sync xv:input="in4" xv:field="#answer"/> <vxml:form id= "stateForm"> <vxml:field name= "state" xv:id= "answer"> <vxml:prompt>Say a state name</vxml:prompt> <vxml:grammar src = "state.grxml"/> </vxml:field> </vxml:form></head>

<body <form ev:event="load" ev:handler="#stateForm"> Result: <input type="text" name="in4"/> </form></body>

</html>


Exercise 12


• Reset– Reset the value

• Change– Change the value

• Focus– Prompt for the value now in focus


Exercise 13


• Reset– Reset the value– Author decides if cursor should be moved to the reset value

• Change– Change the value– Author decides if cursor should be moved to the reset value

• Focus– Move the cursor to the item in focus

Documents

Tutorial Developing and Deploying Multimodal Applications James A. Larson Larson Technical Services jim @ larson-tech.com SpeechTEK West February 23, 2007