Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DDeelliivveerraabbllee 22..11
DDIICCIITT AArrcchhiitteeccttuurree
TToooollss,, SSttaannddaarrddss,, HHaarrddwwaarree aanndd SSooffttwwaarree ffoorr tthhee
FFiirrsstt PPrroottoottyyppeess
AAuutthhoorrss:: GGrreegggg DDaaggggeetttt
AAffffiilliiaattiioonnss:: IIBBMM
DDaattee:: 55--OOcctt--22000077
DDooccuummeenntt TTyyppee:: RR
SSttaattuuss//VVeerrssiioonn:: 11..00
DDiisssseemmiinnaattiioonn LLeevveell:: PPUU
FP6 IST-034624 http://dicit.itc.it
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 ii
Project Reference FP6 IST-034624
Project Acronym DICIT
Project Full Title Distant-talking Interfaces for Control of Interactive TV
Dissemination Level PU
Contractual Date of Delivery 31-Aug-2007
Actual Date of Delivery 5-Oct-2007
Document Number DICIT_D2.1_20070930
Type Deliverable
Status & Version 1.0
Number of Pages 4+31
WP Contributing to the
Deliverable WP2 (WP responsible: Gregg Daggett – IBM)
WP Task responsible Gregg Daggett (IBM)
Authors (Affiliation) Gregg Daggett (IBM)
Other Contributors
Nicole Beringer and Matthias Bezold (Elektrobit); Andrea
Buson and Thomas Antonello (Fracarro); Alessio Brutti, Luca
Cristoforetti, Maurizio Omologo and Christian Zieger (FBK-
irst); Roberto Manione and Fiorenza Arisio (Amuser); Lutz
Marquardt and Edwin Mabande (FAU)
Reviewer
EC Project Officer Erwin Valentini
Keywords: architecture, multi-microphone devices, distant-talking speech recognition devices,
voice-operated devices, Interactive TV, anti-intrusion, surveillance.
Abstract:
The purpose of this document is to describe the architecture for the first DICIT prototype
system. This includes the hardware and software components and interfaces which will
comprise the prototype, as well as the software tools and standards which will be used during
the prototype development.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 iii
CCoonntteennttss Contents ..............................................................................................................................................iii
Index of Figures ..................................................................................................................................iv
1. Introduction..................................................................................................................................1
2. DICIT First STB Prototype System Architecture ........................................................................2
2.1. Architecture Overview.........................................................................................................2
2.2. Multichannel Acoustic Processing Subsystem ....................................................................4
2.2.1. Internal Software Architecture...........................................................................................5
2.2.2. Preprocessing (PreProc).....................................................................................................5
2.2.3. Beamforming (BF).............................................................................................................6
2.2.4. Two-Channel Acoustic Echo Cancellation (2C-AEC) ......................................................6
2.2.5. Speaker Localization (SLoc)..............................................................................................6
2.2.6. Smart Speech Filtering (SSF) ............................................................................................6
2.3. Speech & Dialog Management Subsystem..........................................................................7
2.3.1. Interface from Multichannel Acoustic Processing Subsystem ..........................................8
2.3.2. ASR Engine........................................................................................................................8
2.3.3. ASR Acoustic Models........................................................................................................9
2.3.4. ASR Language Models ......................................................................................................9
2.3.5. NLU Engine .......................................................................................................................9
2.3.6. NLU Models ......................................................................................................................9
2.3.7. Dialog Manager................................................................................................................10
2.3.8. CIMA Internal Architecture.............................................................................................10
2.3.9. TTS Engine ......................................................................................................................11
2.4. Content Management Subsystem.......................................................................................12
2.4.1. Electronic Program Guide................................................................................................12
2.4.2. User Profiles.....................................................................................................................13
2.4.3. Design considerations ......................................................................................................14
2.5. TV Control Subsystem.......................................................................................................15
2.5.1. STB ..................................................................................................................................15
2.5.2. Remote Control ................................................................................................................16
2.5.3. STB Interface API............................................................................................................16
3. Tools and Standards for the STB prototype...............................................................................17
3.1. Software Standards ............................................................................................................17
3.1.1. SCXML............................................................................................................................17
3.2. EB GUIDE Studio..............................................................................................................18
3.2.1. SCXML Export ................................................................................................................21
3.2.2. WOZ Plugins....................................................................................................................21
3.3. IBM CIMA Development Tool..........................................................................................22
4. DICIT First Surveillance Prototype System Architecture .........................................................23
4.1. Central Unit........................................................................................................................24
4.2. Acoustic Event Detection and Classification.....................................................................24
5. Appendix A: Prototype Hardware Requirements .....................................................................26
5.1. PC 1....................................................................................................................................26
5.2. PC 2....................................................................................................................................27
5.3. External hardware ..............................................................................................................27
6. Appendix B: STB Interface API Functions ..............................................................................29
7. References ..................................................................................................................................31
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 iv
IInnddeexx ooff FFiigguurreess
Figure 1: Device setup for the first DICIT prototype ..........................................................................2
Figure 2: Subsystems of the first DICIT prototype..............................................................................3
Figure 3: Multichannel Acoustic Processing Subsystem.....................................................................4
Figure 4: Block structure of Multichannel Acoustic Processing .........................................................5
Figure 5: Speech and Dialog Management subsystem ........................................................................7
Figure 6: CIMA Internal Architecture ...............................................................................................10
Figure 7: DTD Fragment for XMLTV...............................................................................................13
Figure 8: E/R model of EPG data ......................................................................................................13
Figure 9: E/R model of User Profile ..................................................................................................14
Figure 10: TV Control Architecture...................................................................................................15
Figure 11: Remote Control for the first DICIT prototype .................................................................16
Figure 12: Sample SCXML State Diagram .......................................................................................17
Figure 13: Sample SCXML code listing............................................................................................18
Figure 14: IBM CIMA Development Tool ........................................................................................22
Figure 15: Device setup for the first DICIT surveillance prototype ..................................................23
Figure 16: Block diagram for the AED..............................................................................................25
Figure 17: PC 1 audio acquisition chain ............................................................................................26
Figure 18: The nested microphone array for audio acquisition .........................................................27
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 1
11.. IInnttrroodduuccttiioonn
The DICIT project requires the design and implementation of functional prototypes as part of the
deliverables for WP2, as outlined in the DICIT Technical Annex [1]. These prototypes will be
based upon new technologies developed in the other technical work packages of the project (WP3,
WP4, WP5). The first prototypes for DICIT (to be delivered midway through the project) will
implement a first version of a distant-talking TV interface incorporating a set-top-box (STB)
platform as well as a surveillance system for anti-intrusion. As the DICIT technologies are further
refined during the course of the project, a second STB-based TV interface prototype is expected
during the second half of the project.
This document describes the software and hardware architecture of the first DICIT prototypes for
the TV and surveillance scenarios, as jointly defined by the DICIT partners. The purpose of this
architecture definition is to facilitate the development, integration and debugging of the software
and hardware components which comprise the first DICIT prototypes.
Following this introductory section, Section 2 of this document contains the heart of the architecture
specification for the TV scenario. After an outline of the overall architecture and hardware layout in
subsection 2.1, the remainder of Section 2 provides details of each of the four subsystems of the
DICIT STB-based prototype.
Section 3 provides information regarding the software tools which will be used during the course of
development of the first DICIT STB-based prototype, as well as a description of the software
standards which will be employed.
Section 4 deals with the first surveillance prototype architecture which will integrate in an already
existing system the functionality of acoustic event detection.
Appendix A provides a complete listing of the hardware and devices which will be required for the
first DICIT prototypes, and Appendix B lists the functions which will be supported for the PC
interface controlling the STB device itself.
While this document is intended to provide a complete overall design of the hardware and software
aspects of the DICIT architecture, the design is neither exhaustive nor final. Since the prototyping
of many DICIT technology components is in progress, a more finely detailed description of the
internal aspects of these components will be contained in the technical deliverables for each of
WP3, WP4 and WP5.
It is expected that this document will remain a living document which will be extended during the
course of the project as the DICIT architecture matures. It will also be revised as part of the
architecture deliverable for the second STB-based and surveillance prototype later in the project.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 2
22.. DDIICCIITT FFiirrsstt SSTTBB PPrroottoottyyppee SSyysstteemm AArrcchhiitteeccttuurree
22..11.. AArrcchhiitteeccttuurree OOvveerrvviieeww
The architecture of the first DICIT STB prototype is divided into several subsections which
correspond to logical functions of the system. These subsections also correlate with the various
WP’s which comprise the technical work for the project.
The expected physical device setup for the first DICIT prototype is shown in the illustration below.
The main elements of the setup include:
• a microphone array and audio hardware front-end for acquiring spoken user input to the
system
• a first PC (PC 1) for handling multichannel acoustic processing of audio input
• a second PC (PC 2) for handling speech and dialog management processing, together with
management of TV EPG information and user profiles
• a set-top-box (STB) and display for TV program viewing
• a remote control for user manual input
• stereo speakers for TV audio and dialog voice output
Remote
Control
PC
TV/DISPLAY
Set-Top-Box
PC2 PC1
Nested Microphone ArrayLeft
Speaker
Right
Speaker
Audio Hardware Front-End
Figure 1: Device setup for the first DICIT prototype
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 3
From a logical standpoint, the first DICIT prototype architecture is divided into the following four
subsystems:
• Multichannel Acoustic Processing Subsystem
• Speech & Dialog Management Subsystem
• Content Management Subsystem
• TV Control Subsystem
Figure 2: Subsystems of the first DICIT prototype
Each of the four subsystems is described in more detail in the following Sections 2.2 through 2.5.
Multichannel
Acoustic Processing
Speech &Dialog Management
Content Management
TV Control
Remote Control
Audio Input
TV Display
Multichannel
Acoustic Processing
Speech &Dialog Management
Content Management
TV Control
Remote Control
Audio Input
TV Display
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 4
22..22.. MMuullttiicchhaannnneell AAccoouussttiicc PPrroocceessssiinngg SSuubbssyysstteemm
In the DICIT STB prototype, the Multichannel Acoustic Processing subsystem is responsible for
acquiring audio signals from the DICIT microphones and performing all related pre-processing of
the audio before it is sent to the ASR module in the Speech and Dialog Management subsystem.
Figure 3 below shows the architecture of the Multichannel Acoustic Processing subsystem, which is
managed by PC 1.
BF
TV / STB
System
RemoteControl
PC 1
PC 2
MC-SoundcardInput Output
Pre-Amp
…
…
ADC DAC
2C-AEC
PreProc
3
2
SLoc
SSF (SAD)
2 1
2nn microphones
n
3
Amp2
2
Figure 3: Multichannel Acoustic Processing Subsystem
Audio signals coming from the microphone array, together with the TV audio channels and
synthesis are converted by an ADC (Analog-to-Digital Converter) module and acquired by a
multichannel soundcard. Signals may need to be pre-amplified by an external module. The
soundcard will also be connected to a DAC (Digital-to-Analog Converter) module to output the
processed TV signals to the loudspeakers. Connection between soundcard and conversion modules
is obtained using the ADAT/Toslink protocol.
The audio processing of the Multichannel Acoustic Processing subsystem is obtained combining
different software modules: Beamforming (BF), Source Localization (SLoc), Preprocessing
(PreProc), Two-channel Acoustic Echo Cancellation (2C-AEC), and Smart Speech Filtering (SSF)
including Speech Activity Detection (SAD).
Here follows a description of the software architecture and of each module.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 5
22..22..11.. IInntteerrnnaall SSooffttwwaarree AArrcchhiitteeccttuurree
Within the Multichannel Acoustic Processing subsystem will run a main program that will take care
of the audio input/output and data processing. The various processing modules will be organized as
libraries that will be exchanged between the partners.
After an initial phase of setup and configuration, the main loop of the program is composed by the
acquisition section and three modules that run sequentially. However, parallel processing will be
investigated where possible, using threads and a multi-core CPU. Figure 4 shows the software
structure of the Multichannel Acoustic Processing subsystem.
The acquired input data frame is made available to the SLoc module that will process it in parallel.
In case it runs slower than real-time, BF module will not wait for SLoc output but will use the
previous results. In this way the system’s response time will not be affected.
Figure 4: Block structure of Multichannel Acoustic Processing
22..22..22.. PPrreepprroocceessssiinngg ((PPrreePPrroocc))
The two channels of the stereo signal which consist of a combination of the stereo signal from the
TV and the TTS-output from the Speech and Dialog Management subsystem are in general highly
correlated. Therefore they cannot be fed into the loudspeakers directly, but have to be decorrelated
before, in order to allow an unambiguous identification of each echo path by the 2C-AEC unit –
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 6
thus robustness of the 2C-AEC against changing acoustic environments and time-variant
beamforming is ensured. The decorrelation is performed by the preprocessing unit, directing the
output signal both to the loudspeakers and the 2C-AEC unit – the objective is to assure inaudibility
of the signal manipulations while enabling a fast convergence of the 2C-AEC-filters.
22..22..33.. BBeeaammffoorrmmiinngg ((BBFF))
The beamforming unit exploits the spatial distribution of sources and interferers in order to
attenuate the latter. By means of filtering and adding up the n microphone signals, a beam with an
increased sensitivity in the 'look direction' is formed. This 'look direction' depends on the
information of the source localization unit, so that it is possible to track movements of the source.
The generated output signal is directed to the 2C-AEC block.
22..22..44.. TTwwoo--CChhaannnneell AAccoouussttiicc EEcchhoo CCaanncceellllaattiioonn ((22CC--AAEECC))
Two-channel acoustic echo cancellation uses the information from the available loudspeaker signals
to suppress the residual acoustic feedback from the loudspeakers within the beamformer output
signal. The 2C-AEC filters therefore have to adaptively model the system consisting of the
loudspeaker-enclosure-microphone system and the following time-varying BF system. Besides the
BF output also the output of the PreProc unit has therefore to be fed into the 2C-AEC unit. The
output of the echo canceller should contain no echo any more and serves as the major input for the
SSF unit.
22..22..55.. SSppeeaakkeerr LLooccaalliizzaattiioonn ((SSLLoocc))
The speaker localization module is in charge with determining the position of the currently active
speaker. The module receives as input the signals acquired by the nested array. So far, the output
includes the azimuth and the 3D coordinates estimation, but it can be easily extended with any other
information that may be useful to the other modules, as for instance the level of plausibility of the
estimation.
22..22..66.. SSmmaarrtt SSppeeeecchh FFiilltteerriinngg ((SSSSFF))
The aim of this module is to process the continuous pre-processed audio stream and other
information coming from the SLoc and 2C-AEC modules in order to provide the ASR with speech
chunks to be processed and discard any non-speech event, including background noise and other
possible interferences.
After the classification process, the SSF sends the speech chunks to the ASR exploiting the TCP-
based standard internet protocol which is specifically designed for transmission of real-time data,
described in Section 2.3.1.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 7
22..33.. SSppeeeecchh && DDiiaalloogg MMaannaaggeemmeenntt SSuubbssyysstteemm
The Speech & Dialog Management subsystem is responsible for automatic speech recognition
(ASR) processing on the audio input from the Multichannel Acoustic Processing subsystem, as well
as dialog management and response generation via text-to-speech (TTS). It is also responsible for
coordinating the multimodal aspects of the user interaction, allowing user input via handheld remote
control in addition to voice, and allowing visual output via TV on-screen-display (OSD).
Figure 5: Speech and Dialog Management subsystem
Figure 5 shows the components of the Speech and Dialog Management subsystem, along with the
interfaces to the other three subsystems of the DICIT prototype. Each of these components will be
described in further detail below.
The DICIT partners decided during the initial WP2 architecture meetings that a good choice for the
Speech and Dialog Management subsystem would be IBM’s Conversational Interaction Manager
Architecture (CIMA). The CIMA framework provides an extensible architecture which closely
matches the requirements for the DICIT prototypes, in particular:
• a built-in ASR component based on IBM’s Embedded ViaVoice technology
• a built-in TTS component based on IBM’s Embedded TTS technology
Dialog Manager
ASR
TTS
NLUNLUModel
LanguageModel
AcousticModel
SCXML
to TV Control
subsystem
to Content
Management subsystem
Acoustic Processing
Interface
PC 1Multichannel Acoustic Processing Subsystem
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 8
• a built-in Dialog Manager component based on the SCXML language
• natural language understanding (NLU) capability using statistical language models and
action classification
• an extensible component architecture allowing custom device components to be added for
content management and TV control
• support for multiple spoken languages
• trainable acoustic models for the ASR engine
• support for user profiles
A key advantage for using CIMA for the DICIT prototypes is that CIMA already provides
integration with ASR, NLU and TTS engines, thereby eliminating the need to re-integrate these
components from scratch.
22..33..11.. IInntteerrffaaccee ffrroomm MMuullttiicchhaannnneell AAccoouussttiicc PPrroocceessssiinngg SSuubbssyysstteemm
The interface from the Multichannel Acoustic Processing subsystem to the Speech and Dialog
Management subsystem allows the flow of acoustic information from the front-end components in
the DICIT architecture. For the first DICIT prototype, this interface will provide both an audio data
stream to the ASR component, as well as control information from the front-end such as speech
activity detection (SAD).
Since the first DICIT prototype will be distributed across two PC systems, the ASR acoustic input
interface will be based on a network protocol which allows the sending of acoustic information
from the Multichannel Acoustic Processing subsystem PC to the Speech and Dialog Management
subsystem PC. The network protocol for this interface will be RTP (real time protocol), a TCP-
based standard internet protocol which is specifically designed for transmission of real-time data.
The format of the audio input data received by the ASR component for the first DICIT prototype
will be as follows:
• single channel audio
• 16-bit PCM
• 16 kHz sampling rate
This format is compatible with both the front-end output from the Multichannel Acoustic
Processing subsystem and with the ASR input to the ViaVoice engine.
22..33..22.. AASSRR EEnnggiinnee
The ASR component in DICIT will be based on IBM’s Embedded ViaVoice (EVV) engine [2], an
HMM-based recognizer which supports both PC and embedded hardware platforms. The CIMA
framework provides built-in integration with the EVV engine, allowing recognized words to be
acted upon in various ways, depending on the vocabularies and actions defined within CIMA.
The EVV engine is a speaker-independent ASR engine which supports both small and large
vocabularies. A vocabulary can be either a finite state grammar (FSG) with a structured syntax, or
a statistical language model which allows free-form natural speech input.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 9
For the DICIT STB prototypes, statistical language models will be used for the ASR vocabulary in
order to allow free-form voice input by users.
22..33..33.. AASSRR AAccoouussttiicc MMooddeellss
The Embedded ViaVoice ASR engine supports acoustic models which can be trained or adapted to
particular acoustic environments. For DICIT, this allows the models to be tuned specifically for the
acoustic audio which will be generated by the DICIT Multichannel Acoustic Processing subsystem.
Adaptation of the EVV acoustic models is accomplished through collection of representative audio
adaptation data (typically several hours of recordings) and then adapting a base model to this new
set of data. The adaptation process is done using IBM’s acoustic model adaptation tools, which can
take a pre-recorded set of audio data as input, and generate a new adapted acoustic model as output.
For the first DICIT prototype, the audio adaptation data will include contaminated speech data
derived from a clean speech corpus. This will provide a larger amount of adaptation data than
would otherwise be available.
Since the first DICIT prototype is expected to support three spoken languages (English, Italian,
German), the adaptation audio and acoustic model adaptation process will be done in all three
languages, and three different acoustic models will be generated.
22..33..44.. AASSRR LLaanngguuaaggee MMooddeellss
The EVV ASR engine also supports statistical language models which are a key component of the
natural language understanding (NLU) capabilities of the CIMA framework. The statistical
language models allow free-form ASR which is not constrained according to pre-defined grammars.
This enhances the usability of the DICIT system by allowing natural input by voice, without
requiring users to memorize a fixed syntax for commands which are spoken.
Similar to the EVV acoustic models, the EVV ASR language models are trained to a particular
environment and application. This is accomplished by providing a set of representative language
training data in textual form, which is processed using IBM’s language model training tools.
As with the acoustic model training in the DICIT prototype, the language model training will be
done on three sets of language training data in order to support the expected languages (English,
Italian and German).
22..33..55.. NNLLUU EEnnggiinnee
The natural language understanding component in the DICIT prototype will use the NLU engine
which is part of the IBM CIMA framework. This engine is based on Action Classification
technology, which is able to map arbitrary recognized phrases from the ASR engine into categories
of actions which are appropriate for each phrase.
22..33..66.. NNLLUU MMooddeellss
The models used by the CIMA NLU engine are statistically based and are trained for a particular
application and domain of interaction. Similar to the ASR language models, the NLU models are
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 10
trained on a set of representative text training data. This text training data is first annotated (using
an IBM text annotation tool) in order to provide the classification information for words in the text.
Once the training data is annotated then an NLU model can be generated from that data using an
IBM model generation tool.
Since the NLU models are language dependent, it is expected there will be three sets of models
generated for the DICIT prototypes in order to support the planned languages (English, Italian,
German). However, for the first DICIT prototype at month 18, it may be the case that there is
insufficient language data for training accurate Italian and German NLU models. In that case, those
languages may use a grammar-based ASR system for the first prototype instead of a statistical NLU
system, while the English first prototype will use NLU.
22..33..77.. DDiiaalloogg MMaannaaggeerr
The dialog manager component in the DICIT prototype will be based on the CIMA framework
dialog manager, which manages all interactions with user input/output (including voice, haptic and
visual) and manages the interfaces to external data and devices (such as the Electronic Program
Guide (EPG) database and the TV and handheld remote devices).
22..33..88.. CCIIMMAA IInntteerrnnaall AArrcchhiitteeccttuurree
Figure 6: CIMA Internal Architecture
Figure 6 shows the internal components of the CIMA framework. The Interaction Manager consists
of an Event Manager, Session Manager, Strategy Manager and Action Manager. The interaction
between the four managers is defined as one event loop of the Interaction Manager.
• The Event Manager creates CIMA events and manages the event queue.
play prompt
getSpokenInput
Session Manager
Event Manager
Data Model
(ECMAScript)
Strategy Manager
Action Manager
listen
remote
TV control
device A
device C
actions
device B
device
Application 1
Application 2
Application n
action
actions fire
events dispatched to
correct session(s)
events dispatched to
correct strateg(ies)
Policy
strategies
Interaction Manager
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 11
• The Session Manager manages session contexts. It is responsible for finding a correct session
context for the current event. The session context contains all data of the strategy. The current
embedded CIMA implementation contains one session.
• The Strategy Manager is responsible for creating, maintaining, and destroying strategies. The
strategies can be understood as a set of rules for managing the conversation. The Strategy
Manager contains a Policy that decides which strategy should be used for handling the event. As
a result of the handling, the Strategy outputs an Action Request List.
• The Action Manager creates, maintains, and destroys actions. The manager starts and stops
appropriate actions based on the Action Request List. If the device generates an event, the
Action Manager may call the Event Manager to create and fire a new CIMA event.
The CIMA components use the shared Data Model for information interchange.
Strategies
A Strategy is an algorithm that analyzes an input Event, and produces an Action Request List.
Generally, strategies can be implemented without any limitations on the programming language.
CIMA applications are implemented as a state machine defining the dialog flow in the format of the
SCXML (State Chart XML), and using ECMAscript code to perform executive calls.
Interfacing with Devices
The CIMA framework can interface with external devices via the action interface. The action
interface specifies an ECMA data structure for controlling the device and for receiving events from
the device. The driver for the external device then implements the action interface. CIMA has built-
in action interfaces for ASR, NLU and TTS engines.
22..33..99.. TTTTSS EEnnggiinnee
The text-to-speech (TTS) component in the DICIT prototype will use the IBM Embedded TTS
engine which is part of the CIMA framework. This TTS engine is based on concatenative speech
technology and supports both PC and embedded hardware platforms. The engine is available in
several different languages, including the three languages planned for the first DICIT prototype
(English, Italian, German). For each language, the engine supports different male and female
voices, allowing for some degree of customization for a particular application or listener.
The audio output from the TTS component in DICIT is routed to both the sound system of the TV
(for prompt playback to the user) as well as to the input of the Multichannel Acoustic Processing
subsystem (for use by the Two-channel Acoustic Echo Cancellation).
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 12
22..44.. CCoonntteenntt MMaannaaggeemmeenntt SSuubbssyysstteemm
The Content Management Subsystem will mainly provide storage for two kinds of semi-permanent
data sets: the Electronic Program Guide (EPG) and the User Profiles.
The EPG will be downloaded from the Internet according to the XMLTV [3] syntax. The system
needs to have Internet access, but not necessarily continuous. If no up-to-date EPG can be found,
the DICIT functions related to the EPG will not be available (empty EPG).
Due to the size of the EPG data set, the storage will be based upon a Relational Database
Management System (RDBMS), such as, for example the Open Source projects MySQL or
PostGres.
The User Profiles are intended to store various user preferences such as favorite TV program,
preferred TV channels for news and entertainment, and other custom settings.
The User Profile system is smaller and simpler than the EPG system and does not necessarily need
to be implemented using an RDBMS. However, for sake of uniformity, the specification for both
these subsystems will be given in the form of an Entity-Relationship Information Model.
22..44..11.. EElleeccttrroonniicc PPrrooggrraamm GGuuiiddee
For the EPG data set, the RDMBS will act as the trait-d’union among the “producers” of the EPG
data, external to the DICIT system (e.g. servers on the Internet) and the “consumer” of such data,
actually is the DICIT dialog system.
The dialog system will use such data in a number of cases, the most important of which are:
1. the construction of a “permanent” binding among the list of channels known to the STB
(accessed via their number) and a list of channels mentioned within the EPG
2. the construction of program lists to be shown in the EPG listings
3. the construction of dynamic grammar for the speech recognition in the various cases of
search (e.g. by channel name, by actor, ….)
The EPG will be periodically downloaded from the Internet according to the XMLTV syntax.
As an example update policy, the refresh of the EPG could be done once in a week and should
download the EPG data for the subsequent week. In this case, at any time the DBMS would store
the EPG data for the 2 weeks: the current one and the next one; a rough estimate of the size of such
a data set is in the range of tens of megabytes for a reasonably large EPG.
In case that the CMS realizes that its data are not up-to-date, due for example to the fact that the
system has been disconnected from the internet, it will download fresh data as soon as possible.
In the following, the more significant parts of the XMLTV DTD are reported (notice that not the
elements and attributed are listed, only the most important for the DICIT purposes).
<!ELEMENT tv (channel*, programme*)>
<!ELEMENT channel (display-name+, icon*, url*) >
<!ATTLIST channel id CDATA #REQUIRED >
<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,
category*, language?, orig-language?, length?,
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 13
icon*, url*, country*, episode-num*, video?, audio?,
previously-shown?, premiere?, last-chance?, new?,
subtitles*, rating*, star-rating? )>
<!ATTLIST programme start CDATA #REQUIRED
stop CDATA #IMPLIED
pdc-start CDATA #IMPLIED
vps-start CDATA #IMPLIED
showview CDATA #IMPLIED
videoplus CDATA #IMPLIED
channel CDATA #REQUIRED
clumpidx CDATA "0/1" >
<!ELEMENT credits (director*, actor*, writer*, adapter*, producer*,
presenter*, commentator*, guest* )>
<!ELEMENT director (#PCDATA)>
<!ELEMENT actor (#PCDATA)>
<!ELEMENT writer (#PCDATA)>
<!ELEMENT adapter (#PCDATA)>
<!ELEMENT producer (#PCDATA)>
<!ELEMENT presenter (#PCDATA)>
<!ELEMENT commentator (#PCDATA)>
<!ELEMENT guest (#PCDATA)>
Figure 7: DTD Fragment for XMLTV
The above DTD fragment reports three information pieces, related one to the other: Channel,
Programme, Credits. A possible E/R model of the EPG data, extracted from the XMLTV data is as
follows. The model covers the needs of the dialogs designed for the DICIT prototype #1.
Figure 8: E/R model of EPG data
22..44..22.. UUsseerr PPrrooffiilleess
The User Profiles will keep information related to the rights, preferences and tracked behaviour of
the users of the DICIT system. A possible E/R model of the User Profile is as follows. The model
exceeds the needs of the dialogs designed for the first DICIT prototype.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 14
Figure 9: E/R model of User Profile
For DICIT, the CIMA framework will provide built-in support for user profiles. The selection and
alteration of the profile data will be under the control of the CIMA dialog manager. When a user
identifies himself or herself to the system (through a username or PIN), then the profile for that user
will be activated by the dialog manager.
22..44..33.. DDeessiiggnn ccoonnssiiddeerraattiioonnss
The two data sets will not be related with each other by explicit relationships (indexes), in order to
allow them to be updated separately, even if at a slightly higher computational cost.
For example, despites the field “channel name” of table “blocked_preferred_programs”, clearly has
its domain coincident with the field “name” of table “channels”, it will not be modelled as a
relationship (e.g. stored as an index) but instead will be stored with its full name.
This will keep the User profile consistent after reloads of the EPG section, when the indexes of the
same channel can change over time.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 15
22..55.. TTVV CCoonnttrrooll SSuubbssyysstteemm
The TV Control Subsystem is responsible for the TV program decoding and displaying as well as
the rendering of Teletext and the on-screen display (OSD) output, which is the visual output of the
entire system. It’s composed of a Set Top Box (STB) controlled by the Dialog Manager by means
of an RS-232 serial connection to PC 2.
Figure 10: TV Control Architecture
22..55..11.. SSTTBB
The STB component decodes a live free to air (not scrambled) TV program from a satellite signal.
Programs are normally composed of 1 video stream and 1 or more audio stream as well as Teletext
data. The STB is able to reproduce a program and let the choice of the audio source.
It automatically restores the current program in case of loss of satellite signal.
It has a F-connector for SAT input and is able to control multiple dishes by mean of DiseqC
signalling. That means that the same STB is able to receive programs from more than one satellite.
The outputs of the STB are 3 RCA connectors: one for composite video output and two for
analogue stereo audio. For the first DICIT prototype digital audio is not available.
The STB has also 2 LEDs for activity signalling: one is green and the other is red. Their behaviour
is controlled by the PC.
The STB doesn’t really switch off but just disables video output (shows a black page) when not
active.
The Electronic Program Guide (EPG) is not handled by the STB because the EPG information
embedded on the satellite signal is not always available and not complete.
For the first DICIT prototype the program list is not obtained from live satellite signal but is
embedded in the system. Also Teletext output is not available as an OSD page in the first DICIT
prototype. Teletext support is completely working as in other STB products via TV data
To Audio ADC
Dialog Manager
STB
RS-232
Analog
Audio
Video
Satellite
Signal
(DVB-S)
DLL API
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 16
information. That means that the Teletext cannot be controlled by the dialog manager, but only by
means of the remote control of the TV.
The STB firmware is upgradeable via a serial connection.
22..55..22.. RReemmoottee CCoonnttrrooll
The remote control will be a universal model that can be programmed to act as a different one. It
will be programmed as the one of a digital satellite receiver. All of its keys are available for use
except for the following that are not transmitted: 1-, 2-, MOV, VIEW, F1, F2, F3, F4, BACK, RAD,
LANG, LNB, H/V, 22K.
The software interface from the remote control to PC 2 will be WinLIRC, a standard open-source
interface which allows Windows-based PC’s to receive signals from infrared remotes. The received
signals will be sent by WinLIRC to the CIMA dialog manager which also running on PC 2,
allowing the dialog manager to respond to keys on the remote, in addition to voice commands.
Figure 11: Remote Control for the first DICIT prototype
22..55..33.. SSTTBB IInntteerrffaaccee AAPPII
The STB interface consists of a DLL for Windows OS that communicates via RS232 with the STB.
The actual serial communication speed is 115200 baud.
The DLL is responsible of the correct communication with the STB. It translates the API command
into a sequence of commands compatible with the STB firmware.
Variable memory allocation is not performed by the library and must be done by the calling
program.
The list of exported functions is detailed in Appendix B: STB Interface API Functions.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 17
33.. TToooollss aanndd SSttaannddaarrddss ffoorr tthhee SSTTBB pprroottoottyyppee
33..11.. SSooffttwwaarree SSttaannddaarrddss
This section describes some of the software standards which will be a key part of the DICIT
architecture.
33..11..11.. SSCCXXMMLL
State Chart XML (SCXML) is an emerging W3C standard for expressing state machines [4], and is
becoming more common as a language for specifying dialog control flow in speech dialog
managers. It is an XML-based declarative language which supports specification of both simple
and complex state machines.
In the DICIT prototypes, SCXML will be used as the dialog manager control language for the
CIMA dialog manager. The IBM CIMA Development Tool (described in Section 3.3) will produce
SCXML output from authored visual dialog diagrams, and the generated SCXML will then be
interpreted by the CIMA dialog manager during execution of the DICIT prototypes.
Figure 12 shows a sample state diagram (for a coffee maker in this case) containing several states
and transitions between states. This visual state diagram would typically be authored by a state
machine developer (or a voice dialog developer in the case of voice dialogs).
Figure 12: Sample SCXML State Diagram
Figure 13 below shows the actual SCXML code listing for the state diagram in Figure 12. Without
automatic conversion tools, a state machine author or dialog author would typically need to write
the SCXML code manually, in order to match the states and transitions represented in the state
diagram. However, through the use of SCXML development tools (such as the IBM CIMA
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 18
Development Tool or EB GUIDE Studio), the SCXML code can be automatically generated from
the visual diagram, thereby greatly improving the ease and speed for developing SCXML-based
applications.
Figure 13: Sample SCXML code listing
33..22.. EEBB GGUUIIDDEE SSttuuddiioo
EB GUIDE Studio is an authoring tool for the detailed specification of multimodal dialogs.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 19
Both multimodal HMI design for graphical / haptical and speech dialog systems (SDS) can be
specified in parallel in one tool due to an integrated solution with intelligent editors
Illustrated state chart Editor (including pop up
handling): UML state charts
Speech Editor (multi modal
HMI, grammar editor): prompts
and commands
Speech Dialog Designer GUI Designer
Translator
Dialog Editor
Dialog Simulator Statechart-Editor
View Editor Language Editor
I18N Export
Grammar
Dialog Flow
Views
Widgets
Menu Logic
Texts
XML-DB
Graphics Designer
Simulator
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 20
The application appearance of all editors is user configurable.
Automated consistency checks through checker Plugins make sure that properties are consistent
within and across modalities and formally complete. It is also possible to take advantage of dialogue
parts specified in other projects by using the project merge functions.
Specification is UML-chart conform. The different editors also support a general modeling style by
offering templates, (deep) history states, conditions, enter and exit actions, transition specifications
and for speech command and prompt states as well as SRGS grammars (which can be easily
translated to VoiceXML or JSGF).
It is also possible to generate multi displays and have multimodal support driven by multiple state
machines, e.g. EPG, remote control, speech input. The multiple and hierarchical state machine
support includes also a synchronization function to synchronize the different multi displays.
An open API for independent developing of specialized Plugins, Checkers or other functions allows
to have project dependent extras like for example a Wizard-of-Oz environment. All data (dialog
flows as well as images, prompts and commands) is stored in a central XML repository.
Apart from the specification of multimodal dialogs the tool offers also to simulate the entire
taskflow for demonstration and usability checks and to change properties, values or taskflow on the
fly if the look, hear and feel of the specified dialog system needs to be modified. It is also possible
to have simulation interfaces to real hardware via TCP, USB/RS232, LVDS, e.g. to connect the
STB directly to the specified dialog for testing. For speech input and output it is possible to use one
of the supported state-of-the-art recognizers, like IBM Embedded ViaVoice.
In EB GUIDE Studio it is also possible to export the specified detailed dialog into other formats,
e.g. SCXML which is supported by CIMA or to code generate the specification for direct usage on
the target, e.g. the HMI target in the automotive domain.
Select modern music on the
air now
Simulation Data pool
View Editor : Views and widgets Widget Editor: Event
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 21
33..22..11.. SSCCXXMMLL EExxppoorrtt
Using the SCXML Export plugin, it is possible to export a GUIDE project to SCXML with CIMA
extensions. That way, it is possible to create and test a project in EB GUIDE Studio, create an
export, and then run it with the IBM CIMA dialog manager.
While the GUIDE state-chart model and SCXML/CIMA are quite similar, there still are some
differences. Some features of GUIDE are not supported by SCXML (e.g. deep history states) or
CIMA (e.g. history states) and can therefore not be used. Checkers can detect uses of these features
and issue a warning.
Moreover, an extension was added to GUIDE that allows creating more than one event for one
grammar. In regular GUIDE, only one event can be created for a grammar. But in CIMA projects,
one grammar creates different events (or events plus different conditions) depending on the result of
the grammar parse. Events for different return values can be specified with the SpeechEvents
plugin.
33..22..22.. WWOOZZ PPlluuggiinnss
For the WOZ recording, a special set of plugins was developed. These plugins were used to enable
to wizard to remote control the system during the recordings, to interact with the user by means of
TTS, and to record the session to a log file. For the WOZ recordings, an older version of GUIDE
(2.8) was used. So far, the WOZ plugins are not available for EB GUIDE Studio, but would need to
be ported.
� The State Explorer plugin shows a list of states and the wizard can jump to these states.
Since two parallel state machines for the haptic and the speech state machine were used, the
states are changed in both state machines.
� In the SD Command Simulation Frame plugin, the wizard can see the list of active
prompts and can trigger them by means of a double-click.
� The TTS Simulation Frame plugin offers a pre-defined list of TTS prompts, which are
stored in an XML file. The wizard can trigger these or can enter a TTS prompt manually.
� The Simulation Logger plugin creates a log file from the current session. For every
recording, a new log file can be created. These log files can then be used for the evaluation
of the system.
� Using the Control Panel plugin, the wizard can press the buttons from the remote control
(except for the numbers).
� The Recognizing plugin provides a button and a shortcut (F12) that can be used to tell the
user that the system recognized an input and is currently processing it. A TTS is generated
and a small icon shown on the screen for 3 seconds.
� The EPG plugin provides EPG functionality for the DICIT model. The back-end is a movie
database and the front-end a frame, which can controlled (e.g. start query) either by the
wizard or by means of events.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 22
33..33.. IIBBMM CCIIMMAA DDeevveellooppmmeenntt TTooooll
The IBM CIMA Development Tool is a dialog authoring tool for the IBM CIMA (Conversational
Interaction Manager Architecture) dialog manager. It allows speech dialog authors to create speech
applications using a visual dialog flow diagram containing dialog states and transitions, as shown in
Figure 14 below. These dialog flows can be transformed by the tool into SCXML markup
language, which can then be run on the IBM CIMA dialog manager runtime engine.
Figure 14: IBM CIMA Development Tool
The IBM CIMA Development Tool is an IDE (integrated development environment) based on the
Eclipse open development platform. The tool supports attaching of necessary resources (grammars,
prompts, multimodal content, etc.) to a speech application. It also supports creation of scripts
within various states of a speech application, in order to connect the application to external data and
devices.
A key part of the CIMA development environment is the CIMA templates, which contain pre-
defined scripts for commonly used functions. These templates allow more rapid development of
speech applications by eliminating the necessity of dialog authors to write these common functions
from scratch.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 23
44.. DDIICCIITT FFiirrsstt SSuurrvveeiillllaannccee PPrroottoottyyppee SSyysstteemm AArrcchhiitteeccttuurree
The DICIT architecture for the first surveillance prototype consists of a central unit which
communicates with intrusion detection sensors placed in the environment as well as output devices
such as sirens and control devices that allow the configuration and control of the system by the user
(e.g. keypads, card readers). There are several types of sensors which can be used in a stand-alone
or in a cooperative configuration, for instance infrared volumetric or double technology (IR and
microwave), magnetic and vibration.
The novelty for surveillance is a new intrusion detection sensor based on Acoustic Event Detection
(AED) implemented on a PC that will be connected to the existing Fracarro surveillance system via
a serial communication line. That will allow the system to combine the alarm information provided
by the sensors with acoustic events, and can reduce the occurrence of false alarms.
In the figure below is shown the architecture of the first DICIT surveillance prototype.
Figure 15: Device setup for the first DICIT surveillance prototype
The AED monitors the environment acoustically by means of a distributed microphone network
placed in the house’s rooms. The AED system communicates to the central unit any detected
Distributed Microphone Network
PC for AED
Central Unit
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 24
acoustic event along with the temporal and type information. The central unit combines the
information coming from the AED and the other sensors to decide to activate or deactivate the
alarm.
44..11.. CCeennttrraall UUnniitt
The surveillance system will be composed of 4 base lines (sensors), but more lines are connectable
to the central unit via a module communicating with it through a serial link. The lines are
configurable to generate different kinds of alarms based on the sensor information: instantaneous
alarm, delayed alarm, emergency, anti-robbery, technological, medical alarm, tamper alarm, and
key (used to control the system). The system lines can be grouped into 8 different partitions that can
overlap each other. That makes it easy to activate partial protection of the system. For example a
night partition can be made of sensors on windows and external doors and exclude volumetric
indoor sensors.
Different code types are provided: 1 code for the system installer, 1 master code, and 8 user’s codes
with 2 priority levels each associated to some of the available partitions.
The user can activate or deactivate one or more partitions through a menu-based interface on the
keypad or through the key reader. Every key is associated to the user and allows activation of
different partitions by means of partially predefined insertions of the key.
A serial bus with the possibility to connect up to 16 devices is available. Remote keypads and an
electronic key reader can be connected to the bus RS-485 connection.
The system has a memory which is able to store up to 100 events including the temporal
information (date and time).
A telephone dialer can be connected to the system with an optional speech module. In that case it
provides 8 communication channels (which can be configured to send vocal alarms or digital
communications for surveillance centers) that are available with a list of 32 telephone numbers
associable to the communication channels. The installation of the speech module allows the user to
be contacted by phone when an alarm has been detected, sending a vocal message according to the
type of the alarm (anti-robbery, anti-intrusion, technological, emergency). It is also possible to
personalize the messages, the telephone number to be called and the number of call repetitions.
44..22.. AAccoouussttiicc EEvveenntt DDeetteeccttiioonn aanndd CCllaassssiiffiiccaattiioonn
The AED first processes the signals acquired by a distributed microphone network through the
multichannel processing module, then the acoustic event detection block extracts from a continuous
audio stream those segments characterized by any sound event and sends them to the acoustic event
classifier whose aim is to identify the event type according to a predefined list of possible events.
Finally, the information about the classified event is sent to central unit. In the first surveillance
prototype the AED will run on a PC which communicates with the central unit through an
asynchronous serial line RS-485 with a speed of 9600 baud.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 25
Distributed Microphone Network
Acoustic Event
Detection
Acoustic Event
Classification
Multichannel
Processing
to the Central Unit
Figure 16: Block diagram for the AED.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 26
55.. AAppppeennddiixx AA:: PPrroottoottyyppee HHaarrddwwaarree RReeqquuiirreemmeennttss
The DICIT hardware for the first prototype can be divided into three main blocks: PC 1, PC 2, and
external hardware. Each of these blocks is described in further detail in the sections below.
55..11.. PPCC 11
PC 1 hosts the Multichannel Acoustic Processing subsystem and the audio interface. The PC should
be a powerful machine equipped with a multi-core CPU, suitable to parallelize the execution of
different software modules.
A suggested hardware configuration could be the following:
• Intel Quad Core Xeon E5320
• 4GB RAM
• 2 x 320GB SATA HD
An important section of the PC is the audio data recording hardware. The digital board could be the
RME HDSP 9652, connected via ADAT optical to three RME Octamic D acquisition boards. An
external output board connected via ADAT is required to play the de-correlated TV signals. A
suggested board is the MOTU 896HD. Figure 17 shows the hardware configuration of PC 1. The
number of channels to be recorded is 15 (from the nested array) + 2 (TV left and right channels) + 1
(synthesis). Later in the project, TV audio channels could be 5 + 1.
Figure 17: PC 1 audio acquisition chain
• The RME HDSP 9652 is a digital acquisition board that offers 3 ADAT optical I/O, ADAT-
sync In, SPDIF I/O, word clock I/O. It is a PCI board that supports the ALSA drivers.
• The RME Octamic D is an acquisition board with 8 balanced XLR mic/line inputs. Each
channel contains switches for 48V phantom power, a low cut filter and phase reversal.
Amplification gain can be set between 10 and 60 dB. The ADC module adds 8 channels
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 27
pristine 192 kHz at the precision of 24 bits, available as double ADAT output (S/MUX, up
to 96 kHz), and simultaneously via DB-25 connectors as 4 AES outputs (up to 192 kHz).
The ADC can be clocked internally (master), and externally via word clock and AES sync.
• The MOTU 896HD is an acquisition board with 8 microphone preamplifiers, pristine 192
kHz analog I/O, 8 channels of ADAT digital I/O, stereo AES/EBU. The precision is 24 bits.
55..22.. PPCC 22
PC 2 hosts the Speech & Dialog Management subsystem, the Content Management subsystem and
the PC software interfaces for the TV Control subsystem. This PC does not require as much CPU,
memory or disk storage as PC number 1, and requires only a typical desktop configuration:
• single-core CPU 1.8 GHz or above
• 1 GB RAM or above
• 75 GB HD minimum
PC 2 also requires 2 RS-232 serial ports to allow communication to the external devices for the TV
Control subsystem. One serial port will be used for a control connection to the external Set Top
Box, to allow PC 2 to transmit API commands (listed in Appendix B) to the box. The other serial
port will be used for a connection to an IR receiver, to allow PC 2 to receive input commands from
the handheld Remote Control.
55..33.. EExxtteerrnnaall hhaarrddwwaarree
Some hardware needs to be connected to the PCs of the prototype.
• Microphone Array:
On the input side, connected to the three Octamic boards, there will be an array of
microphones. This array will be composed by 15 low-cost electret microphones in a nested
configuration. Figure 18 shows the layout of the array (all distances shown are in
centimeters).
Figure 18: The nested microphone array for audio acquisition
• LCD television:
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 28
Connected to the PC there will be a LCD television. Size of the display should be relative to
the room, but at least a 32” TV is recommended. A suggested model is Sharp Aquos 46”.
• Set Top Box:
A modified version of a commercial STB made available by Fracarro will be connected to a
satellite dish to receive DVB-S free to air digital programs. The STB is based on the ST5105
platform.
• Remote Control:
The infrared remote control will be a programmable universal model, “Planet Alias 1”. It
will be programmed by Fracarro as a digital satellite receiver remote, compatible with the
STB system.
• Surround system:
The TV and PC 2 will be connected to a 5 + 1 audio surround system. For the first prototype
only two stereo channels will be used. A suggested model of system is the Genelec 8020A
plus the Genelec 7050B subwoofer.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 29
66.. AAppppeennddiixx BB:: SSTTBB IInntteerrffaaccee AAPPII FFuunnccttiioonnss
Functions for Program Discovery and Selection
• GetChannelScanStatus (ScanStatusEnum_t *status, WORD *ProgramFound)
Get the STB Program Scan Status.
• GetCurrentProgram (WORD *program_no)
Retrieve the Program the STB is showing.
• GetProgramAudio (BYTE *Index)
Get the audo index for the current program.
• GetSatParameter (BYTE SatNo, SATELLITE_STRUCT *SatConfig)
Get the satellite configuration parameters.
• ProgramScan (SATELLITE_STRUCT *Satellite, BYTE Mode, BOOL Blocking)
Perform a STB Program Scan.
• QueryProgramlList (WORD *program_no, PROGRAM_STRUCT *programs)
Get the Program list of the STB.
• SetCurrentProgram (WORD program_no)
Set the STB Program.
• SetProgramAudio (BYTE Index)
Set the audio index for the current program.
• SetSatParameter (BYTE SatNo, SATELLITE_STRUCT *SatConfig)
Set the parameters for the satellite configuration.
• StopProgramScan ()
Stop a non-blocking STB Program Scan.
General Purpose Functions
• GetSTBPower (BYTE *OnOff)
Get the Power state of the STB.
• GetSTBStatus (StatusEnum_t *status)
Get the STB Status.
• GetSTBSWVersion (CHAR *SWVer)
Get the SW version of the STB.
• RestartSTB ()
Perform a software reset for the STB.
• SetBlankPage (BOOL OnOff, DWORD Color)
Set a blank or colored screen.
• SetRadioPage (BOOL OnOff)
Set a radio image on the screen.
• SetSTBPower (BYTE OnOff)
Set the Power state of the STB
• SetSTBSerialPort (CHAR *SerialPort)
Set the PC serial port used to communicate with the STB.
Functions for STB Audio Volume Control
• GetSTBAudioMute (BOOL *OnOff)
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 30
Get the mute state.
• GetSTBAudioVolume (BYTE *Volume, BYTE *Min, BYTE *Max)
Get the STB Audio Volume and volume limits.
• SetSTBAudioMute (BOOL OnOff)
Set the mute state.
• SetSTBAudioVolume (BYTE Volume)
Set the STB Audio Volume.
Functions for STB OSD
• OSDClearScreen ()
Clear the OSD screen and show video output.
• OSDRefresh ()
Refresh the OSD screen after a change of strings or layout.
• OSDSetLayout (BYTE Layout)
Set the OSD layout.
• OSDSetText (ROW_STRUCT *OSDText, BYTE Struct_no)
Set the text for the main OSD Area.
• OSDShowBitmap (BYTE Bitmap_no, WORD X, WORD Y, BYTE Transparence)
Show a Bitmap.
• OSDShowPopup (BYTE Rows, ROW_STRUCT *PopupText, , BYTE Struct_no, WORD Timeout)
Show Popup Message.
• OSDClearPopup ()
Hide a popup message.
• OSDShowScrollbar (BYTE Value, BYTE Min, BYTE Max)
Show the scrollbar.
• OSDShowVolumeGauge (BYTE Volume, BYTE Min, BYTE Max, WORD Timeout)
Show the volume gauge for the specified time.
D2.1 – DICIT Architecture
DICIT_D2.1_20071005 31
77.. RReeffeerreenncceess
[1] Distant Talking Interfaces for Control of Interactive TV
Annex I - Description of Work 31-May-2006
[2] IBM Embedded ViaVoice Developer’s Guide
June 2007
IBM Corporation
[3] XMLTV
Homepage: http://xmltv.org
Project: http://sourceforge.net/projects/xmltv/
[4] State Chart XML (SCXML)
W3C Working Draft
http://www.w3.org/TR/scxml/