Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Figure captions:
Figure 1. Typical information flow in a multimodal architecture
Figure 2. Facilitated multimodal architecture
Figure 3. The three-tiered bottom-up MTC architecture, comprised of multiple members, multiple teams, and a decision-making committee
Figure 4. Functionality, architectural features, and general classification of different multimodal speech and gesture systems.
Figure 5. Quickset system running on a wireless handheld PC, showing user-created entities on the map interface
Figure 6. The OGI Quickset system’s facilitated multi-agent architecture
Figure 7. Architectural flow of signal and language processing in IBM’s HCWP system
Figure 8. Boeing 777 aircraft’s main equipment center
Figure 9. Boeings natural language understanding technology
Figure 10. Boeings integrated speech and gesture system architecture
Figure 11. The NCR Field Medic Associates (FMA) flexible hardware components (top); FMA inserted into medics vest while speaking during field use(bottom)
Figure 12. The NCR Field Medic Coordinators (FMC) visual interface
Figure 13. The BBN Portable Voice Assistant (PVA) interface illustrating a diagram from the repair manual, with the systems display and spokenlanguage processing status confirmed at the top
Figure 14. Data flow through the PVA applets three processing threads
Figure 15: BBNs networked architecture for the PVA, with speech recognition processing split up such that only feature extraction takes place on thehandheld
Figure 1
Figure 2
Committee
Team 1 Team 2
Team 3 Team 4
Memberrecognizer
Memberrecognizer
Memberrecognizer
Memberrecognizer
Decision
Acceleration
Stroke length
Pressure
Speech rate
Signal energy
Imagecentroid
Linecrossings
Speech pitch accentMultiple Input Features
Figure 3
0XOWLPRGDO�6\VWHP&KDUDFWHULVWLFV� 4XLFN6HW +XPDQ�&HQWULF
:RUG�3URFHVVRU
95�$LUFUDIW0DLQWHQDQFH
7UDLQLQJ
)LHOG�0HGLF,QIRUPDWLRQ
3RUWDEOH�9RLFH$VVLVWDQW
5HFRJQLWLRQ�RI�VLPXOWDQHRXV�RUDOWHUQDWLYH�LQGLYLGXDO�PRGHV
Simultaneous &individual modes
Simultaneous &individual modes
Simultaneous &individual modes
Alternativeindividual modes1
Simultaneous &individual modes
7\SH��VL]H�RIJHVWXUH�YRFDEXODU\
Pen input,Multiple gestures,Large vocabulary
Pen input,Deictic selection
3D manual input,Multiple gestures,Small vocabulary
Pen input,Deictic selection
Pen inputDeictic selection2
6L]H�RI�VSHHFK�YRFDEXODU\��W\SH�RI�OLQJXLVWLF�SURFHVVLQJ
Moderate vocabulary,Grammar-based
Large vocabulary,Statistical language
processing
Small vocabulary,Grammar-based
Moderate vocabulary,Grammar-based
Small vocabulary,Grammar-based
7\SH�RI�VLJQDO�IXVLRQ
Late semantic fusion,Unification,
Hybrid symbolic/statisticalMTC framework
Late semantic fusion,Frame-based
Late semantic fusion,Frame-based
No mode fusionLate semantic fusion,
Frame-based
7\SH�RI�SODWIRUP�DSSOLFDWLRQV
Wireless handheld,Varied map & VR
applications4
Desktop computer,Word processing
Virtual reality system,Aircraft maintenance
training
Wireless handheld,Medical fieldemergencies
Wireless handheld,Catalogue ordering
(YDOXDWLRQ�VWDWXVProactive user-centered
design & iterative systemevaluations
Proactive user-centereddesign
Planned for futureProactive user-centered
design & iterativesystem evaluations
Planned for future
Figure 4
1 The FMA component recognizes speech only, and the FMC component recognizes gestural selections or speech. The FMC also can transmit digital speech andink data, and can read data from smart cards and physiological monitors.2 The PVA also performs handwriting recognition.3 A small speech vocabulary is up to 200 words, moderate 300-1,000 words, and large in excess of 1,000 words. For pen-based gestures, deictic selection is anindividual gesture, a small vocabulary is 2-20 gestures, moderate 20-100, and large in excess of 100 gestures.4 QuickSet’s map applications include military simulation, medical informatics, real estate selection, and so forth.
Figure 5
Gesture
Java-enabledWeb pages
ALP
Facilitatorrouting, triggeringdispatching,
ModSAFSimulator
Speech/TTS
NaturalLanguage
QuickSetInterface
MultimodalIntegration
SimulationAgent
ApplicationBridge
DatabasesOtherFacilitators
CORBAbridge
Other userinterfaces(e.g.,Wizard)
ExInit
VR Interfaces:CommandVu
Dragon2VGIS
COMobjects
Interagent Communication Language
ICL — Horn Clauses
Figure 6
Figure 7
Figure 8
Figure 9
1DWXUDO�/DQJXDJH�8QGHUVWDQGLQJ�7HFKQRORJ\
6HQWHQFH�/HYHO�$QDO\VLV 'LVFRXUVH�/HYHO�$QDO\VLV
7RNHQL]HU 6\QWDFWLF3DUVHU
6HPDQWLF,QWHUSUHWHU
6\QWDFWLF/H[LFRQ�DQG3DWWHUQV
6\QWDFWLF*UDPPDU
6WUXFWXUH,QLWLDOL]HU
(QWLW\,QWHUSUHWHU
$FWLRQ�DQG(YHQW,QWHUSUHWHU
'RPDLQ5HOHYDQW5HODWLRQV
,QWHU�VHQWHQWLDO5HODWLRQV$QDO\]HU
5HIHUHQFH5HVROYHU
6HPDQWLF�/H[LFRQV
6HPDQWLF�7\SH�+LHUDUFK\
Figure 10
��6SHHFK�*HVWXUH�,QWHJUDWLRQ�$UFKLWHFWXUH
:RUG�6WULQJV
'LYLVLRQ
�*HVWXUH�����9L]��$FWRU���������$FWRU
'DWD*ORYH
'LYLVLRQ2EMHFW�,'¶VDQG�RU/RFDWLRQV
1/8�6\VWHP
*HVWXUH�,QWHUSUHWDWLRQ
,QWHJUDWLRQ�6\VWHP &RPPDQG�*HQHUDWLRQ�6\VWHP
&200$1')UDPHV
*(6785()UDPHV
1DYLJDWLRQ�0DQLSXODWLRQ�6HOHFWLRQ&RPPDQGV
8QLILHG&RPPDQG�*HVWXUH7HPSODWHV
6SHHFK5HFRJQL]HU
*HVWXUH5HFRJQL]HU
Figure 11
Figure 12
Figure 13
Application Output(audio and visual)
Pen Speech
Integrationand
Response
From voicerecognition server
From stylus ormouse
Visualfeedback
Audiofeedback
Events
Figure 14
Wireless Network
HandheldClient
Server
HTML
SpeechFront-end
Speech parameters& Control
WebServer
RecognitionServer
Web Browserrunning Javaapplet
Microphone
Images/Audio Speech API
Grammar-Vocabulary
Java Classes Pen
Figure 15