35
Deliverable 2.1 DICIT Architecture Tools, Standards, Hardware and Software for the First Prototypes Authors: Gregg Daggett Affiliations: IBM Date: 5-Oct-2007 Document Type: R Status/Version: 1.0 Dissemination Level: PU FP6 IST-034624 http://dicit.itc.it

Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

DDeelliivveerraabbllee 22..11

DDIICCIITT AArrcchhiitteeccttuurree

TToooollss,, SSttaannddaarrddss,, HHaarrddwwaarree aanndd SSooffttwwaarree ffoorr tthhee

FFiirrsstt PPrroottoottyyppeess

AAuutthhoorrss:: GGrreegggg DDaaggggeetttt

AAffffiilliiaattiioonnss:: IIBBMM

DDaattee:: 55--OOcctt--22000077

DDooccuummeenntt TTyyppee:: RR

SSttaattuuss//VVeerrssiioonn:: 11..00

DDiisssseemmiinnaattiioonn LLeevveell:: PPUU

FP6 IST-034624 http://dicit.itc.it

Page 2: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 ii

Project Reference FP6 IST-034624

Project Acronym DICIT

Project Full Title Distant-talking Interfaces for Control of Interactive TV

Dissemination Level PU

Contractual Date of Delivery 31-Aug-2007

Actual Date of Delivery 5-Oct-2007

Document Number DICIT_D2.1_20070930

Type Deliverable

Status & Version 1.0

Number of Pages 4+31

WP Contributing to the

Deliverable WP2 (WP responsible: Gregg Daggett – IBM)

WP Task responsible Gregg Daggett (IBM)

Authors (Affiliation) Gregg Daggett (IBM)

Other Contributors

Nicole Beringer and Matthias Bezold (Elektrobit); Andrea

Buson and Thomas Antonello (Fracarro); Alessio Brutti, Luca

Cristoforetti, Maurizio Omologo and Christian Zieger (FBK-

irst); Roberto Manione and Fiorenza Arisio (Amuser); Lutz

Marquardt and Edwin Mabande (FAU)

Reviewer

EC Project Officer Erwin Valentini

Keywords: architecture, multi-microphone devices, distant-talking speech recognition devices,

voice-operated devices, Interactive TV, anti-intrusion, surveillance.

Abstract:

The purpose of this document is to describe the architecture for the first DICIT prototype

system. This includes the hardware and software components and interfaces which will

comprise the prototype, as well as the software tools and standards which will be used during

the prototype development.

Page 3: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 iii

CCoonntteennttss Contents ..............................................................................................................................................iii

Index of Figures ..................................................................................................................................iv

1. Introduction..................................................................................................................................1

2. DICIT First STB Prototype System Architecture ........................................................................2

2.1. Architecture Overview.........................................................................................................2

2.2. Multichannel Acoustic Processing Subsystem ....................................................................4

2.2.1. Internal Software Architecture...........................................................................................5

2.2.2. Preprocessing (PreProc).....................................................................................................5

2.2.3. Beamforming (BF).............................................................................................................6

2.2.4. Two-Channel Acoustic Echo Cancellation (2C-AEC) ......................................................6

2.2.5. Speaker Localization (SLoc)..............................................................................................6

2.2.6. Smart Speech Filtering (SSF) ............................................................................................6

2.3. Speech & Dialog Management Subsystem..........................................................................7

2.3.1. Interface from Multichannel Acoustic Processing Subsystem ..........................................8

2.3.2. ASR Engine........................................................................................................................8

2.3.3. ASR Acoustic Models........................................................................................................9

2.3.4. ASR Language Models ......................................................................................................9

2.3.5. NLU Engine .......................................................................................................................9

2.3.6. NLU Models ......................................................................................................................9

2.3.7. Dialog Manager................................................................................................................10

2.3.8. CIMA Internal Architecture.............................................................................................10

2.3.9. TTS Engine ......................................................................................................................11

2.4. Content Management Subsystem.......................................................................................12

2.4.1. Electronic Program Guide................................................................................................12

2.4.2. User Profiles.....................................................................................................................13

2.4.3. Design considerations ......................................................................................................14

2.5. TV Control Subsystem.......................................................................................................15

2.5.1. STB ..................................................................................................................................15

2.5.2. Remote Control ................................................................................................................16

2.5.3. STB Interface API............................................................................................................16

3. Tools and Standards for the STB prototype...............................................................................17

3.1. Software Standards ............................................................................................................17

3.1.1. SCXML............................................................................................................................17

3.2. EB GUIDE Studio..............................................................................................................18

3.2.1. SCXML Export ................................................................................................................21

3.2.2. WOZ Plugins....................................................................................................................21

3.3. IBM CIMA Development Tool..........................................................................................22

4. DICIT First Surveillance Prototype System Architecture .........................................................23

4.1. Central Unit........................................................................................................................24

4.2. Acoustic Event Detection and Classification.....................................................................24

5. Appendix A: Prototype Hardware Requirements .....................................................................26

5.1. PC 1....................................................................................................................................26

5.2. PC 2....................................................................................................................................27

5.3. External hardware ..............................................................................................................27

6. Appendix B: STB Interface API Functions ..............................................................................29

7. References ..................................................................................................................................31

Page 4: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 iv

IInnddeexx ooff FFiigguurreess

Figure 1: Device setup for the first DICIT prototype ..........................................................................2

Figure 2: Subsystems of the first DICIT prototype..............................................................................3

Figure 3: Multichannel Acoustic Processing Subsystem.....................................................................4

Figure 4: Block structure of Multichannel Acoustic Processing .........................................................5

Figure 5: Speech and Dialog Management subsystem ........................................................................7

Figure 6: CIMA Internal Architecture ...............................................................................................10

Figure 7: DTD Fragment for XMLTV...............................................................................................13

Figure 8: E/R model of EPG data ......................................................................................................13

Figure 9: E/R model of User Profile ..................................................................................................14

Figure 10: TV Control Architecture...................................................................................................15

Figure 11: Remote Control for the first DICIT prototype .................................................................16

Figure 12: Sample SCXML State Diagram .......................................................................................17

Figure 13: Sample SCXML code listing............................................................................................18

Figure 14: IBM CIMA Development Tool ........................................................................................22

Figure 15: Device setup for the first DICIT surveillance prototype ..................................................23

Figure 16: Block diagram for the AED..............................................................................................25

Figure 17: PC 1 audio acquisition chain ............................................................................................26

Figure 18: The nested microphone array for audio acquisition .........................................................27

Page 5: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 1

11.. IInnttrroodduuccttiioonn

The DICIT project requires the design and implementation of functional prototypes as part of the

deliverables for WP2, as outlined in the DICIT Technical Annex [1]. These prototypes will be

based upon new technologies developed in the other technical work packages of the project (WP3,

WP4, WP5). The first prototypes for DICIT (to be delivered midway through the project) will

implement a first version of a distant-talking TV interface incorporating a set-top-box (STB)

platform as well as a surveillance system for anti-intrusion. As the DICIT technologies are further

refined during the course of the project, a second STB-based TV interface prototype is expected

during the second half of the project.

This document describes the software and hardware architecture of the first DICIT prototypes for

the TV and surveillance scenarios, as jointly defined by the DICIT partners. The purpose of this

architecture definition is to facilitate the development, integration and debugging of the software

and hardware components which comprise the first DICIT prototypes.

Following this introductory section, Section 2 of this document contains the heart of the architecture

specification for the TV scenario. After an outline of the overall architecture and hardware layout in

subsection 2.1, the remainder of Section 2 provides details of each of the four subsystems of the

DICIT STB-based prototype.

Section 3 provides information regarding the software tools which will be used during the course of

development of the first DICIT STB-based prototype, as well as a description of the software

standards which will be employed.

Section 4 deals with the first surveillance prototype architecture which will integrate in an already

existing system the functionality of acoustic event detection.

Appendix A provides a complete listing of the hardware and devices which will be required for the

first DICIT prototypes, and Appendix B lists the functions which will be supported for the PC

interface controlling the STB device itself.

While this document is intended to provide a complete overall design of the hardware and software

aspects of the DICIT architecture, the design is neither exhaustive nor final. Since the prototyping

of many DICIT technology components is in progress, a more finely detailed description of the

internal aspects of these components will be contained in the technical deliverables for each of

WP3, WP4 and WP5.

It is expected that this document will remain a living document which will be extended during the

course of the project as the DICIT architecture matures. It will also be revised as part of the

architecture deliverable for the second STB-based and surveillance prototype later in the project.

Page 6: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 2

22.. DDIICCIITT FFiirrsstt SSTTBB PPrroottoottyyppee SSyysstteemm AArrcchhiitteeccttuurree

22..11.. AArrcchhiitteeccttuurree OOvveerrvviieeww

The architecture of the first DICIT STB prototype is divided into several subsections which

correspond to logical functions of the system. These subsections also correlate with the various

WP’s which comprise the technical work for the project.

The expected physical device setup for the first DICIT prototype is shown in the illustration below.

The main elements of the setup include:

• a microphone array and audio hardware front-end for acquiring spoken user input to the

system

• a first PC (PC 1) for handling multichannel acoustic processing of audio input

• a second PC (PC 2) for handling speech and dialog management processing, together with

management of TV EPG information and user profiles

• a set-top-box (STB) and display for TV program viewing

• a remote control for user manual input

• stereo speakers for TV audio and dialog voice output

Remote

Control

PC

TV/DISPLAY

Set-Top-Box

PC2 PC1

Nested Microphone ArrayLeft

Speaker

Right

Speaker

Audio Hardware Front-End

Figure 1: Device setup for the first DICIT prototype

Page 7: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 3

From a logical standpoint, the first DICIT prototype architecture is divided into the following four

subsystems:

• Multichannel Acoustic Processing Subsystem

• Speech & Dialog Management Subsystem

• Content Management Subsystem

• TV Control Subsystem

Figure 2: Subsystems of the first DICIT prototype

Each of the four subsystems is described in more detail in the following Sections 2.2 through 2.5.

Multichannel

Acoustic Processing

Speech &Dialog Management

Content Management

TV Control

Remote Control

Audio Input

TV Display

Multichannel

Acoustic Processing

Speech &Dialog Management

Content Management

TV Control

Remote Control

Audio Input

TV Display

Page 8: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 4

22..22.. MMuullttiicchhaannnneell AAccoouussttiicc PPrroocceessssiinngg SSuubbssyysstteemm

In the DICIT STB prototype, the Multichannel Acoustic Processing subsystem is responsible for

acquiring audio signals from the DICIT microphones and performing all related pre-processing of

the audio before it is sent to the ASR module in the Speech and Dialog Management subsystem.

Figure 3 below shows the architecture of the Multichannel Acoustic Processing subsystem, which is

managed by PC 1.

BF

TV / STB

System

RemoteControl

PC 1

PC 2

MC-SoundcardInput Output

Pre-Amp

ADC DAC

2C-AEC

PreProc

3

2

SLoc

SSF (SAD)

2 1

2nn microphones

n

3

Amp2

2

Figure 3: Multichannel Acoustic Processing Subsystem

Audio signals coming from the microphone array, together with the TV audio channels and

synthesis are converted by an ADC (Analog-to-Digital Converter) module and acquired by a

multichannel soundcard. Signals may need to be pre-amplified by an external module. The

soundcard will also be connected to a DAC (Digital-to-Analog Converter) module to output the

processed TV signals to the loudspeakers. Connection between soundcard and conversion modules

is obtained using the ADAT/Toslink protocol.

The audio processing of the Multichannel Acoustic Processing subsystem is obtained combining

different software modules: Beamforming (BF), Source Localization (SLoc), Preprocessing

(PreProc), Two-channel Acoustic Echo Cancellation (2C-AEC), and Smart Speech Filtering (SSF)

including Speech Activity Detection (SAD).

Here follows a description of the software architecture and of each module.

Page 9: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 5

22..22..11.. IInntteerrnnaall SSooffttwwaarree AArrcchhiitteeccttuurree

Within the Multichannel Acoustic Processing subsystem will run a main program that will take care

of the audio input/output and data processing. The various processing modules will be organized as

libraries that will be exchanged between the partners.

After an initial phase of setup and configuration, the main loop of the program is composed by the

acquisition section and three modules that run sequentially. However, parallel processing will be

investigated where possible, using threads and a multi-core CPU. Figure 4 shows the software

structure of the Multichannel Acoustic Processing subsystem.

The acquired input data frame is made available to the SLoc module that will process it in parallel.

In case it runs slower than real-time, BF module will not wait for SLoc output but will use the

previous results. In this way the system’s response time will not be affected.

Figure 4: Block structure of Multichannel Acoustic Processing

22..22..22.. PPrreepprroocceessssiinngg ((PPrreePPrroocc))

The two channels of the stereo signal which consist of a combination of the stereo signal from the

TV and the TTS-output from the Speech and Dialog Management subsystem are in general highly

correlated. Therefore they cannot be fed into the loudspeakers directly, but have to be decorrelated

before, in order to allow an unambiguous identification of each echo path by the 2C-AEC unit –

Page 10: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 6

thus robustness of the 2C-AEC against changing acoustic environments and time-variant

beamforming is ensured. The decorrelation is performed by the preprocessing unit, directing the

output signal both to the loudspeakers and the 2C-AEC unit – the objective is to assure inaudibility

of the signal manipulations while enabling a fast convergence of the 2C-AEC-filters.

22..22..33.. BBeeaammffoorrmmiinngg ((BBFF))

The beamforming unit exploits the spatial distribution of sources and interferers in order to

attenuate the latter. By means of filtering and adding up the n microphone signals, a beam with an

increased sensitivity in the 'look direction' is formed. This 'look direction' depends on the

information of the source localization unit, so that it is possible to track movements of the source.

The generated output signal is directed to the 2C-AEC block.

22..22..44.. TTwwoo--CChhaannnneell AAccoouussttiicc EEcchhoo CCaanncceellllaattiioonn ((22CC--AAEECC))

Two-channel acoustic echo cancellation uses the information from the available loudspeaker signals

to suppress the residual acoustic feedback from the loudspeakers within the beamformer output

signal. The 2C-AEC filters therefore have to adaptively model the system consisting of the

loudspeaker-enclosure-microphone system and the following time-varying BF system. Besides the

BF output also the output of the PreProc unit has therefore to be fed into the 2C-AEC unit. The

output of the echo canceller should contain no echo any more and serves as the major input for the

SSF unit.

22..22..55.. SSppeeaakkeerr LLooccaalliizzaattiioonn ((SSLLoocc))

The speaker localization module is in charge with determining the position of the currently active

speaker. The module receives as input the signals acquired by the nested array. So far, the output

includes the azimuth and the 3D coordinates estimation, but it can be easily extended with any other

information that may be useful to the other modules, as for instance the level of plausibility of the

estimation.

22..22..66.. SSmmaarrtt SSppeeeecchh FFiilltteerriinngg ((SSSSFF))

The aim of this module is to process the continuous pre-processed audio stream and other

information coming from the SLoc and 2C-AEC modules in order to provide the ASR with speech

chunks to be processed and discard any non-speech event, including background noise and other

possible interferences.

After the classification process, the SSF sends the speech chunks to the ASR exploiting the TCP-

based standard internet protocol which is specifically designed for transmission of real-time data,

described in Section 2.3.1.

Page 11: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 7

22..33.. SSppeeeecchh && DDiiaalloogg MMaannaaggeemmeenntt SSuubbssyysstteemm

The Speech & Dialog Management subsystem is responsible for automatic speech recognition

(ASR) processing on the audio input from the Multichannel Acoustic Processing subsystem, as well

as dialog management and response generation via text-to-speech (TTS). It is also responsible for

coordinating the multimodal aspects of the user interaction, allowing user input via handheld remote

control in addition to voice, and allowing visual output via TV on-screen-display (OSD).

Figure 5: Speech and Dialog Management subsystem

Figure 5 shows the components of the Speech and Dialog Management subsystem, along with the

interfaces to the other three subsystems of the DICIT prototype. Each of these components will be

described in further detail below.

The DICIT partners decided during the initial WP2 architecture meetings that a good choice for the

Speech and Dialog Management subsystem would be IBM’s Conversational Interaction Manager

Architecture (CIMA). The CIMA framework provides an extensible architecture which closely

matches the requirements for the DICIT prototypes, in particular:

• a built-in ASR component based on IBM’s Embedded ViaVoice technology

• a built-in TTS component based on IBM’s Embedded TTS technology

Dialog Manager

ASR

TTS

NLUNLUModel

LanguageModel

AcousticModel

SCXML

to TV Control

subsystem

to Content

Management subsystem

Acoustic Processing

Interface

PC 1Multichannel Acoustic Processing Subsystem

Page 12: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 8

• a built-in Dialog Manager component based on the SCXML language

• natural language understanding (NLU) capability using statistical language models and

action classification

• an extensible component architecture allowing custom device components to be added for

content management and TV control

• support for multiple spoken languages

• trainable acoustic models for the ASR engine

• support for user profiles

A key advantage for using CIMA for the DICIT prototypes is that CIMA already provides

integration with ASR, NLU and TTS engines, thereby eliminating the need to re-integrate these

components from scratch.

22..33..11.. IInntteerrffaaccee ffrroomm MMuullttiicchhaannnneell AAccoouussttiicc PPrroocceessssiinngg SSuubbssyysstteemm

The interface from the Multichannel Acoustic Processing subsystem to the Speech and Dialog

Management subsystem allows the flow of acoustic information from the front-end components in

the DICIT architecture. For the first DICIT prototype, this interface will provide both an audio data

stream to the ASR component, as well as control information from the front-end such as speech

activity detection (SAD).

Since the first DICIT prototype will be distributed across two PC systems, the ASR acoustic input

interface will be based on a network protocol which allows the sending of acoustic information

from the Multichannel Acoustic Processing subsystem PC to the Speech and Dialog Management

subsystem PC. The network protocol for this interface will be RTP (real time protocol), a TCP-

based standard internet protocol which is specifically designed for transmission of real-time data.

The format of the audio input data received by the ASR component for the first DICIT prototype

will be as follows:

• single channel audio

• 16-bit PCM

• 16 kHz sampling rate

This format is compatible with both the front-end output from the Multichannel Acoustic

Processing subsystem and with the ASR input to the ViaVoice engine.

22..33..22.. AASSRR EEnnggiinnee

The ASR component in DICIT will be based on IBM’s Embedded ViaVoice (EVV) engine [2], an

HMM-based recognizer which supports both PC and embedded hardware platforms. The CIMA

framework provides built-in integration with the EVV engine, allowing recognized words to be

acted upon in various ways, depending on the vocabularies and actions defined within CIMA.

The EVV engine is a speaker-independent ASR engine which supports both small and large

vocabularies. A vocabulary can be either a finite state grammar (FSG) with a structured syntax, or

a statistical language model which allows free-form natural speech input.

Page 13: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 9

For the DICIT STB prototypes, statistical language models will be used for the ASR vocabulary in

order to allow free-form voice input by users.

22..33..33.. AASSRR AAccoouussttiicc MMooddeellss

The Embedded ViaVoice ASR engine supports acoustic models which can be trained or adapted to

particular acoustic environments. For DICIT, this allows the models to be tuned specifically for the

acoustic audio which will be generated by the DICIT Multichannel Acoustic Processing subsystem.

Adaptation of the EVV acoustic models is accomplished through collection of representative audio

adaptation data (typically several hours of recordings) and then adapting a base model to this new

set of data. The adaptation process is done using IBM’s acoustic model adaptation tools, which can

take a pre-recorded set of audio data as input, and generate a new adapted acoustic model as output.

For the first DICIT prototype, the audio adaptation data will include contaminated speech data

derived from a clean speech corpus. This will provide a larger amount of adaptation data than

would otherwise be available.

Since the first DICIT prototype is expected to support three spoken languages (English, Italian,

German), the adaptation audio and acoustic model adaptation process will be done in all three

languages, and three different acoustic models will be generated.

22..33..44.. AASSRR LLaanngguuaaggee MMooddeellss

The EVV ASR engine also supports statistical language models which are a key component of the

natural language understanding (NLU) capabilities of the CIMA framework. The statistical

language models allow free-form ASR which is not constrained according to pre-defined grammars.

This enhances the usability of the DICIT system by allowing natural input by voice, without

requiring users to memorize a fixed syntax for commands which are spoken.

Similar to the EVV acoustic models, the EVV ASR language models are trained to a particular

environment and application. This is accomplished by providing a set of representative language

training data in textual form, which is processed using IBM’s language model training tools.

As with the acoustic model training in the DICIT prototype, the language model training will be

done on three sets of language training data in order to support the expected languages (English,

Italian and German).

22..33..55.. NNLLUU EEnnggiinnee

The natural language understanding component in the DICIT prototype will use the NLU engine

which is part of the IBM CIMA framework. This engine is based on Action Classification

technology, which is able to map arbitrary recognized phrases from the ASR engine into categories

of actions which are appropriate for each phrase.

22..33..66.. NNLLUU MMooddeellss

The models used by the CIMA NLU engine are statistically based and are trained for a particular

application and domain of interaction. Similar to the ASR language models, the NLU models are

Page 14: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 10

trained on a set of representative text training data. This text training data is first annotated (using

an IBM text annotation tool) in order to provide the classification information for words in the text.

Once the training data is annotated then an NLU model can be generated from that data using an

IBM model generation tool.

Since the NLU models are language dependent, it is expected there will be three sets of models

generated for the DICIT prototypes in order to support the planned languages (English, Italian,

German). However, for the first DICIT prototype at month 18, it may be the case that there is

insufficient language data for training accurate Italian and German NLU models. In that case, those

languages may use a grammar-based ASR system for the first prototype instead of a statistical NLU

system, while the English first prototype will use NLU.

22..33..77.. DDiiaalloogg MMaannaaggeerr

The dialog manager component in the DICIT prototype will be based on the CIMA framework

dialog manager, which manages all interactions with user input/output (including voice, haptic and

visual) and manages the interfaces to external data and devices (such as the Electronic Program

Guide (EPG) database and the TV and handheld remote devices).

22..33..88.. CCIIMMAA IInntteerrnnaall AArrcchhiitteeccttuurree

Figure 6: CIMA Internal Architecture

Figure 6 shows the internal components of the CIMA framework. The Interaction Manager consists

of an Event Manager, Session Manager, Strategy Manager and Action Manager. The interaction

between the four managers is defined as one event loop of the Interaction Manager.

• The Event Manager creates CIMA events and manages the event queue.

play prompt

getSpokenInput

Session Manager

Event Manager

Data Model

(ECMAScript)

Strategy Manager

Action Manager

listen

remote

TV control

device A

device C

actions

device B

device

Application 1

Application 2

Application n

action

actions fire

events dispatched to

correct session(s)

events dispatched to

correct strateg(ies)

Policy

strategies

Interaction Manager

Page 15: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 11

• The Session Manager manages session contexts. It is responsible for finding a correct session

context for the current event. The session context contains all data of the strategy. The current

embedded CIMA implementation contains one session.

• The Strategy Manager is responsible for creating, maintaining, and destroying strategies. The

strategies can be understood as a set of rules for managing the conversation. The Strategy

Manager contains a Policy that decides which strategy should be used for handling the event. As

a result of the handling, the Strategy outputs an Action Request List.

• The Action Manager creates, maintains, and destroys actions. The manager starts and stops

appropriate actions based on the Action Request List. If the device generates an event, the

Action Manager may call the Event Manager to create and fire a new CIMA event.

The CIMA components use the shared Data Model for information interchange.

Strategies

A Strategy is an algorithm that analyzes an input Event, and produces an Action Request List.

Generally, strategies can be implemented without any limitations on the programming language.

CIMA applications are implemented as a state machine defining the dialog flow in the format of the

SCXML (State Chart XML), and using ECMAscript code to perform executive calls.

Interfacing with Devices

The CIMA framework can interface with external devices via the action interface. The action

interface specifies an ECMA data structure for controlling the device and for receiving events from

the device. The driver for the external device then implements the action interface. CIMA has built-

in action interfaces for ASR, NLU and TTS engines.

22..33..99.. TTTTSS EEnnggiinnee

The text-to-speech (TTS) component in the DICIT prototype will use the IBM Embedded TTS

engine which is part of the CIMA framework. This TTS engine is based on concatenative speech

technology and supports both PC and embedded hardware platforms. The engine is available in

several different languages, including the three languages planned for the first DICIT prototype

(English, Italian, German). For each language, the engine supports different male and female

voices, allowing for some degree of customization for a particular application or listener.

The audio output from the TTS component in DICIT is routed to both the sound system of the TV

(for prompt playback to the user) as well as to the input of the Multichannel Acoustic Processing

subsystem (for use by the Two-channel Acoustic Echo Cancellation).

Page 16: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 12

22..44.. CCoonntteenntt MMaannaaggeemmeenntt SSuubbssyysstteemm

The Content Management Subsystem will mainly provide storage for two kinds of semi-permanent

data sets: the Electronic Program Guide (EPG) and the User Profiles.

The EPG will be downloaded from the Internet according to the XMLTV [3] syntax. The system

needs to have Internet access, but not necessarily continuous. If no up-to-date EPG can be found,

the DICIT functions related to the EPG will not be available (empty EPG).

Due to the size of the EPG data set, the storage will be based upon a Relational Database

Management System (RDBMS), such as, for example the Open Source projects MySQL or

PostGres.

The User Profiles are intended to store various user preferences such as favorite TV program,

preferred TV channels for news and entertainment, and other custom settings.

The User Profile system is smaller and simpler than the EPG system and does not necessarily need

to be implemented using an RDBMS. However, for sake of uniformity, the specification for both

these subsystems will be given in the form of an Entity-Relationship Information Model.

22..44..11.. EElleeccttrroonniicc PPrrooggrraamm GGuuiiddee

For the EPG data set, the RDMBS will act as the trait-d’union among the “producers” of the EPG

data, external to the DICIT system (e.g. servers on the Internet) and the “consumer” of such data,

actually is the DICIT dialog system.

The dialog system will use such data in a number of cases, the most important of which are:

1. the construction of a “permanent” binding among the list of channels known to the STB

(accessed via their number) and a list of channels mentioned within the EPG

2. the construction of program lists to be shown in the EPG listings

3. the construction of dynamic grammar for the speech recognition in the various cases of

search (e.g. by channel name, by actor, ….)

The EPG will be periodically downloaded from the Internet according to the XMLTV syntax.

As an example update policy, the refresh of the EPG could be done once in a week and should

download the EPG data for the subsequent week. In this case, at any time the DBMS would store

the EPG data for the 2 weeks: the current one and the next one; a rough estimate of the size of such

a data set is in the range of tens of megabytes for a reasonably large EPG.

In case that the CMS realizes that its data are not up-to-date, due for example to the fact that the

system has been disconnected from the internet, it will download fresh data as soon as possible.

In the following, the more significant parts of the XMLTV DTD are reported (notice that not the

elements and attributed are listed, only the most important for the DICIT purposes).

<!ELEMENT tv (channel*, programme*)>

<!ELEMENT channel (display-name+, icon*, url*) >

<!ATTLIST channel id CDATA #REQUIRED >

<!ELEMENT programme (title+, sub-title*, desc*, credits?, date?,

category*, language?, orig-language?, length?,

Page 17: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 13

icon*, url*, country*, episode-num*, video?, audio?,

previously-shown?, premiere?, last-chance?, new?,

subtitles*, rating*, star-rating? )>

<!ATTLIST programme start CDATA #REQUIRED

stop CDATA #IMPLIED

pdc-start CDATA #IMPLIED

vps-start CDATA #IMPLIED

showview CDATA #IMPLIED

videoplus CDATA #IMPLIED

channel CDATA #REQUIRED

clumpidx CDATA "0/1" >

<!ELEMENT credits (director*, actor*, writer*, adapter*, producer*,

presenter*, commentator*, guest* )>

<!ELEMENT director (#PCDATA)>

<!ELEMENT actor (#PCDATA)>

<!ELEMENT writer (#PCDATA)>

<!ELEMENT adapter (#PCDATA)>

<!ELEMENT producer (#PCDATA)>

<!ELEMENT presenter (#PCDATA)>

<!ELEMENT commentator (#PCDATA)>

<!ELEMENT guest (#PCDATA)>

Figure 7: DTD Fragment for XMLTV

The above DTD fragment reports three information pieces, related one to the other: Channel,

Programme, Credits. A possible E/R model of the EPG data, extracted from the XMLTV data is as

follows. The model covers the needs of the dialogs designed for the DICIT prototype #1.

Figure 8: E/R model of EPG data

22..44..22.. UUsseerr PPrrooffiilleess

The User Profiles will keep information related to the rights, preferences and tracked behaviour of

the users of the DICIT system. A possible E/R model of the User Profile is as follows. The model

exceeds the needs of the dialogs designed for the first DICIT prototype.

Page 18: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 14

Figure 9: E/R model of User Profile

For DICIT, the CIMA framework will provide built-in support for user profiles. The selection and

alteration of the profile data will be under the control of the CIMA dialog manager. When a user

identifies himself or herself to the system (through a username or PIN), then the profile for that user

will be activated by the dialog manager.

22..44..33.. DDeessiiggnn ccoonnssiiddeerraattiioonnss

The two data sets will not be related with each other by explicit relationships (indexes), in order to

allow them to be updated separately, even if at a slightly higher computational cost.

For example, despites the field “channel name” of table “blocked_preferred_programs”, clearly has

its domain coincident with the field “name” of table “channels”, it will not be modelled as a

relationship (e.g. stored as an index) but instead will be stored with its full name.

This will keep the User profile consistent after reloads of the EPG section, when the indexes of the

same channel can change over time.

Page 19: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 15

22..55.. TTVV CCoonnttrrooll SSuubbssyysstteemm

The TV Control Subsystem is responsible for the TV program decoding and displaying as well as

the rendering of Teletext and the on-screen display (OSD) output, which is the visual output of the

entire system. It’s composed of a Set Top Box (STB) controlled by the Dialog Manager by means

of an RS-232 serial connection to PC 2.

Figure 10: TV Control Architecture

22..55..11.. SSTTBB

The STB component decodes a live free to air (not scrambled) TV program from a satellite signal.

Programs are normally composed of 1 video stream and 1 or more audio stream as well as Teletext

data. The STB is able to reproduce a program and let the choice of the audio source.

It automatically restores the current program in case of loss of satellite signal.

It has a F-connector for SAT input and is able to control multiple dishes by mean of DiseqC

signalling. That means that the same STB is able to receive programs from more than one satellite.

The outputs of the STB are 3 RCA connectors: one for composite video output and two for

analogue stereo audio. For the first DICIT prototype digital audio is not available.

The STB has also 2 LEDs for activity signalling: one is green and the other is red. Their behaviour

is controlled by the PC.

The STB doesn’t really switch off but just disables video output (shows a black page) when not

active.

The Electronic Program Guide (EPG) is not handled by the STB because the EPG information

embedded on the satellite signal is not always available and not complete.

For the first DICIT prototype the program list is not obtained from live satellite signal but is

embedded in the system. Also Teletext output is not available as an OSD page in the first DICIT

prototype. Teletext support is completely working as in other STB products via TV data

To Audio ADC

Dialog Manager

STB

RS-232

Analog

Audio

Video

Satellite

Signal

(DVB-S)

DLL API

Page 20: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 16

information. That means that the Teletext cannot be controlled by the dialog manager, but only by

means of the remote control of the TV.

The STB firmware is upgradeable via a serial connection.

22..55..22.. RReemmoottee CCoonnttrrooll

The remote control will be a universal model that can be programmed to act as a different one. It

will be programmed as the one of a digital satellite receiver. All of its keys are available for use

except for the following that are not transmitted: 1-, 2-, MOV, VIEW, F1, F2, F3, F4, BACK, RAD,

LANG, LNB, H/V, 22K.

The software interface from the remote control to PC 2 will be WinLIRC, a standard open-source

interface which allows Windows-based PC’s to receive signals from infrared remotes. The received

signals will be sent by WinLIRC to the CIMA dialog manager which also running on PC 2,

allowing the dialog manager to respond to keys on the remote, in addition to voice commands.

Figure 11: Remote Control for the first DICIT prototype

22..55..33.. SSTTBB IInntteerrffaaccee AAPPII

The STB interface consists of a DLL for Windows OS that communicates via RS232 with the STB.

The actual serial communication speed is 115200 baud.

The DLL is responsible of the correct communication with the STB. It translates the API command

into a sequence of commands compatible with the STB firmware.

Variable memory allocation is not performed by the library and must be done by the calling

program.

The list of exported functions is detailed in Appendix B: STB Interface API Functions.

Page 21: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 17

33.. TToooollss aanndd SSttaannddaarrddss ffoorr tthhee SSTTBB pprroottoottyyppee

33..11.. SSooffttwwaarree SSttaannddaarrddss

This section describes some of the software standards which will be a key part of the DICIT

architecture.

33..11..11.. SSCCXXMMLL

State Chart XML (SCXML) is an emerging W3C standard for expressing state machines [4], and is

becoming more common as a language for specifying dialog control flow in speech dialog

managers. It is an XML-based declarative language which supports specification of both simple

and complex state machines.

In the DICIT prototypes, SCXML will be used as the dialog manager control language for the

CIMA dialog manager. The IBM CIMA Development Tool (described in Section 3.3) will produce

SCXML output from authored visual dialog diagrams, and the generated SCXML will then be

interpreted by the CIMA dialog manager during execution of the DICIT prototypes.

Figure 12 shows a sample state diagram (for a coffee maker in this case) containing several states

and transitions between states. This visual state diagram would typically be authored by a state

machine developer (or a voice dialog developer in the case of voice dialogs).

Figure 12: Sample SCXML State Diagram

Figure 13 below shows the actual SCXML code listing for the state diagram in Figure 12. Without

automatic conversion tools, a state machine author or dialog author would typically need to write

the SCXML code manually, in order to match the states and transitions represented in the state

diagram. However, through the use of SCXML development tools (such as the IBM CIMA

Page 22: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 18

Development Tool or EB GUIDE Studio), the SCXML code can be automatically generated from

the visual diagram, thereby greatly improving the ease and speed for developing SCXML-based

applications.

Figure 13: Sample SCXML code listing

33..22.. EEBB GGUUIIDDEE SSttuuddiioo

EB GUIDE Studio is an authoring tool for the detailed specification of multimodal dialogs.

Page 23: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 19

Both multimodal HMI design for graphical / haptical and speech dialog systems (SDS) can be

specified in parallel in one tool due to an integrated solution with intelligent editors

Illustrated state chart Editor (including pop up

handling): UML state charts

Speech Editor (multi modal

HMI, grammar editor): prompts

and commands

Speech Dialog Designer GUI Designer

Translator

Dialog Editor

Dialog Simulator Statechart-Editor

View Editor Language Editor

I18N Export

Grammar

Dialog Flow

Views

Widgets

Menu Logic

Texts

XML-DB

Graphics Designer

Simulator

Page 24: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 20

The application appearance of all editors is user configurable.

Automated consistency checks through checker Plugins make sure that properties are consistent

within and across modalities and formally complete. It is also possible to take advantage of dialogue

parts specified in other projects by using the project merge functions.

Specification is UML-chart conform. The different editors also support a general modeling style by

offering templates, (deep) history states, conditions, enter and exit actions, transition specifications

and for speech command and prompt states as well as SRGS grammars (which can be easily

translated to VoiceXML or JSGF).

It is also possible to generate multi displays and have multimodal support driven by multiple state

machines, e.g. EPG, remote control, speech input. The multiple and hierarchical state machine

support includes also a synchronization function to synchronize the different multi displays.

An open API for independent developing of specialized Plugins, Checkers or other functions allows

to have project dependent extras like for example a Wizard-of-Oz environment. All data (dialog

flows as well as images, prompts and commands) is stored in a central XML repository.

Apart from the specification of multimodal dialogs the tool offers also to simulate the entire

taskflow for demonstration and usability checks and to change properties, values or taskflow on the

fly if the look, hear and feel of the specified dialog system needs to be modified. It is also possible

to have simulation interfaces to real hardware via TCP, USB/RS232, LVDS, e.g. to connect the

STB directly to the specified dialog for testing. For speech input and output it is possible to use one

of the supported state-of-the-art recognizers, like IBM Embedded ViaVoice.

In EB GUIDE Studio it is also possible to export the specified detailed dialog into other formats,

e.g. SCXML which is supported by CIMA or to code generate the specification for direct usage on

the target, e.g. the HMI target in the automotive domain.

Select modern music on the

air now

Simulation Data pool

View Editor : Views and widgets Widget Editor: Event

Page 25: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 21

33..22..11.. SSCCXXMMLL EExxppoorrtt

Using the SCXML Export plugin, it is possible to export a GUIDE project to SCXML with CIMA

extensions. That way, it is possible to create and test a project in EB GUIDE Studio, create an

export, and then run it with the IBM CIMA dialog manager.

While the GUIDE state-chart model and SCXML/CIMA are quite similar, there still are some

differences. Some features of GUIDE are not supported by SCXML (e.g. deep history states) or

CIMA (e.g. history states) and can therefore not be used. Checkers can detect uses of these features

and issue a warning.

Moreover, an extension was added to GUIDE that allows creating more than one event for one

grammar. In regular GUIDE, only one event can be created for a grammar. But in CIMA projects,

one grammar creates different events (or events plus different conditions) depending on the result of

the grammar parse. Events for different return values can be specified with the SpeechEvents

plugin.

33..22..22.. WWOOZZ PPlluuggiinnss

For the WOZ recording, a special set of plugins was developed. These plugins were used to enable

to wizard to remote control the system during the recordings, to interact with the user by means of

TTS, and to record the session to a log file. For the WOZ recordings, an older version of GUIDE

(2.8) was used. So far, the WOZ plugins are not available for EB GUIDE Studio, but would need to

be ported.

� The State Explorer plugin shows a list of states and the wizard can jump to these states.

Since two parallel state machines for the haptic and the speech state machine were used, the

states are changed in both state machines.

� In the SD Command Simulation Frame plugin, the wizard can see the list of active

prompts and can trigger them by means of a double-click.

� The TTS Simulation Frame plugin offers a pre-defined list of TTS prompts, which are

stored in an XML file. The wizard can trigger these or can enter a TTS prompt manually.

� The Simulation Logger plugin creates a log file from the current session. For every

recording, a new log file can be created. These log files can then be used for the evaluation

of the system.

� Using the Control Panel plugin, the wizard can press the buttons from the remote control

(except for the numbers).

� The Recognizing plugin provides a button and a shortcut (F12) that can be used to tell the

user that the system recognized an input and is currently processing it. A TTS is generated

and a small icon shown on the screen for 3 seconds.

� The EPG plugin provides EPG functionality for the DICIT model. The back-end is a movie

database and the front-end a frame, which can controlled (e.g. start query) either by the

wizard or by means of events.

Page 26: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 22

33..33.. IIBBMM CCIIMMAA DDeevveellooppmmeenntt TTooooll

The IBM CIMA Development Tool is a dialog authoring tool for the IBM CIMA (Conversational

Interaction Manager Architecture) dialog manager. It allows speech dialog authors to create speech

applications using a visual dialog flow diagram containing dialog states and transitions, as shown in

Figure 14 below. These dialog flows can be transformed by the tool into SCXML markup

language, which can then be run on the IBM CIMA dialog manager runtime engine.

Figure 14: IBM CIMA Development Tool

The IBM CIMA Development Tool is an IDE (integrated development environment) based on the

Eclipse open development platform. The tool supports attaching of necessary resources (grammars,

prompts, multimodal content, etc.) to a speech application. It also supports creation of scripts

within various states of a speech application, in order to connect the application to external data and

devices.

A key part of the CIMA development environment is the CIMA templates, which contain pre-

defined scripts for commonly used functions. These templates allow more rapid development of

speech applications by eliminating the necessity of dialog authors to write these common functions

from scratch.

Page 27: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 23

44.. DDIICCIITT FFiirrsstt SSuurrvveeiillllaannccee PPrroottoottyyppee SSyysstteemm AArrcchhiitteeccttuurree

The DICIT architecture for the first surveillance prototype consists of a central unit which

communicates with intrusion detection sensors placed in the environment as well as output devices

such as sirens and control devices that allow the configuration and control of the system by the user

(e.g. keypads, card readers). There are several types of sensors which can be used in a stand-alone

or in a cooperative configuration, for instance infrared volumetric or double technology (IR and

microwave), magnetic and vibration.

The novelty for surveillance is a new intrusion detection sensor based on Acoustic Event Detection

(AED) implemented on a PC that will be connected to the existing Fracarro surveillance system via

a serial communication line. That will allow the system to combine the alarm information provided

by the sensors with acoustic events, and can reduce the occurrence of false alarms.

In the figure below is shown the architecture of the first DICIT surveillance prototype.

Figure 15: Device setup for the first DICIT surveillance prototype

The AED monitors the environment acoustically by means of a distributed microphone network

placed in the house’s rooms. The AED system communicates to the central unit any detected

Distributed Microphone Network

PC for AED

Central Unit

Page 28: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 24

acoustic event along with the temporal and type information. The central unit combines the

information coming from the AED and the other sensors to decide to activate or deactivate the

alarm.

44..11.. CCeennttrraall UUnniitt

The surveillance system will be composed of 4 base lines (sensors), but more lines are connectable

to the central unit via a module communicating with it through a serial link. The lines are

configurable to generate different kinds of alarms based on the sensor information: instantaneous

alarm, delayed alarm, emergency, anti-robbery, technological, medical alarm, tamper alarm, and

key (used to control the system). The system lines can be grouped into 8 different partitions that can

overlap each other. That makes it easy to activate partial protection of the system. For example a

night partition can be made of sensors on windows and external doors and exclude volumetric

indoor sensors.

Different code types are provided: 1 code for the system installer, 1 master code, and 8 user’s codes

with 2 priority levels each associated to some of the available partitions.

The user can activate or deactivate one or more partitions through a menu-based interface on the

keypad or through the key reader. Every key is associated to the user and allows activation of

different partitions by means of partially predefined insertions of the key.

A serial bus with the possibility to connect up to 16 devices is available. Remote keypads and an

electronic key reader can be connected to the bus RS-485 connection.

The system has a memory which is able to store up to 100 events including the temporal

information (date and time).

A telephone dialer can be connected to the system with an optional speech module. In that case it

provides 8 communication channels (which can be configured to send vocal alarms or digital

communications for surveillance centers) that are available with a list of 32 telephone numbers

associable to the communication channels. The installation of the speech module allows the user to

be contacted by phone when an alarm has been detected, sending a vocal message according to the

type of the alarm (anti-robbery, anti-intrusion, technological, emergency). It is also possible to

personalize the messages, the telephone number to be called and the number of call repetitions.

44..22.. AAccoouussttiicc EEvveenntt DDeetteeccttiioonn aanndd CCllaassssiiffiiccaattiioonn

The AED first processes the signals acquired by a distributed microphone network through the

multichannel processing module, then the acoustic event detection block extracts from a continuous

audio stream those segments characterized by any sound event and sends them to the acoustic event

classifier whose aim is to identify the event type according to a predefined list of possible events.

Finally, the information about the classified event is sent to central unit. In the first surveillance

prototype the AED will run on a PC which communicates with the central unit through an

asynchronous serial line RS-485 with a speed of 9600 baud.

Page 29: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 25

Distributed Microphone Network

Acoustic Event

Detection

Acoustic Event

Classification

Multichannel

Processing

to the Central Unit

Figure 16: Block diagram for the AED.

Page 30: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 26

55.. AAppppeennddiixx AA:: PPrroottoottyyppee HHaarrddwwaarree RReeqquuiirreemmeennttss

The DICIT hardware for the first prototype can be divided into three main blocks: PC 1, PC 2, and

external hardware. Each of these blocks is described in further detail in the sections below.

55..11.. PPCC 11

PC 1 hosts the Multichannel Acoustic Processing subsystem and the audio interface. The PC should

be a powerful machine equipped with a multi-core CPU, suitable to parallelize the execution of

different software modules.

A suggested hardware configuration could be the following:

• Intel Quad Core Xeon E5320

• 4GB RAM

• 2 x 320GB SATA HD

An important section of the PC is the audio data recording hardware. The digital board could be the

RME HDSP 9652, connected via ADAT optical to three RME Octamic D acquisition boards. An

external output board connected via ADAT is required to play the de-correlated TV signals. A

suggested board is the MOTU 896HD. Figure 17 shows the hardware configuration of PC 1. The

number of channels to be recorded is 15 (from the nested array) + 2 (TV left and right channels) + 1

(synthesis). Later in the project, TV audio channels could be 5 + 1.

Figure 17: PC 1 audio acquisition chain

• The RME HDSP 9652 is a digital acquisition board that offers 3 ADAT optical I/O, ADAT-

sync In, SPDIF I/O, word clock I/O. It is a PCI board that supports the ALSA drivers.

• The RME Octamic D is an acquisition board with 8 balanced XLR mic/line inputs. Each

channel contains switches for 48V phantom power, a low cut filter and phase reversal.

Amplification gain can be set between 10 and 60 dB. The ADC module adds 8 channels

Page 31: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 27

pristine 192 kHz at the precision of 24 bits, available as double ADAT output (S/MUX, up

to 96 kHz), and simultaneously via DB-25 connectors as 4 AES outputs (up to 192 kHz).

The ADC can be clocked internally (master), and externally via word clock and AES sync.

• The MOTU 896HD is an acquisition board with 8 microphone preamplifiers, pristine 192

kHz analog I/O, 8 channels of ADAT digital I/O, stereo AES/EBU. The precision is 24 bits.

55..22.. PPCC 22

PC 2 hosts the Speech & Dialog Management subsystem, the Content Management subsystem and

the PC software interfaces for the TV Control subsystem. This PC does not require as much CPU,

memory or disk storage as PC number 1, and requires only a typical desktop configuration:

• single-core CPU 1.8 GHz or above

• 1 GB RAM or above

• 75 GB HD minimum

PC 2 also requires 2 RS-232 serial ports to allow communication to the external devices for the TV

Control subsystem. One serial port will be used for a control connection to the external Set Top

Box, to allow PC 2 to transmit API commands (listed in Appendix B) to the box. The other serial

port will be used for a connection to an IR receiver, to allow PC 2 to receive input commands from

the handheld Remote Control.

55..33.. EExxtteerrnnaall hhaarrddwwaarree

Some hardware needs to be connected to the PCs of the prototype.

• Microphone Array:

On the input side, connected to the three Octamic boards, there will be an array of

microphones. This array will be composed by 15 low-cost electret microphones in a nested

configuration. Figure 18 shows the layout of the array (all distances shown are in

centimeters).

Figure 18: The nested microphone array for audio acquisition

• LCD television:

Page 32: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 28

Connected to the PC there will be a LCD television. Size of the display should be relative to

the room, but at least a 32” TV is recommended. A suggested model is Sharp Aquos 46”.

• Set Top Box:

A modified version of a commercial STB made available by Fracarro will be connected to a

satellite dish to receive DVB-S free to air digital programs. The STB is based on the ST5105

platform.

• Remote Control:

The infrared remote control will be a programmable universal model, “Planet Alias 1”. It

will be programmed by Fracarro as a digital satellite receiver remote, compatible with the

STB system.

• Surround system:

The TV and PC 2 will be connected to a 5 + 1 audio surround system. For the first prototype

only two stereo channels will be used. A suggested model of system is the Genelec 8020A

plus the Genelec 7050B subwoofer.

Page 33: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 29

66.. AAppppeennddiixx BB:: SSTTBB IInntteerrffaaccee AAPPII FFuunnccttiioonnss

Functions for Program Discovery and Selection

• GetChannelScanStatus (ScanStatusEnum_t *status, WORD *ProgramFound)

Get the STB Program Scan Status.

• GetCurrentProgram (WORD *program_no)

Retrieve the Program the STB is showing.

• GetProgramAudio (BYTE *Index)

Get the audo index for the current program.

• GetSatParameter (BYTE SatNo, SATELLITE_STRUCT *SatConfig)

Get the satellite configuration parameters.

• ProgramScan (SATELLITE_STRUCT *Satellite, BYTE Mode, BOOL Blocking)

Perform a STB Program Scan.

• QueryProgramlList (WORD *program_no, PROGRAM_STRUCT *programs)

Get the Program list of the STB.

• SetCurrentProgram (WORD program_no)

Set the STB Program.

• SetProgramAudio (BYTE Index)

Set the audio index for the current program.

• SetSatParameter (BYTE SatNo, SATELLITE_STRUCT *SatConfig)

Set the parameters for the satellite configuration.

• StopProgramScan ()

Stop a non-blocking STB Program Scan.

General Purpose Functions

• GetSTBPower (BYTE *OnOff)

Get the Power state of the STB.

• GetSTBStatus (StatusEnum_t *status)

Get the STB Status.

• GetSTBSWVersion (CHAR *SWVer)

Get the SW version of the STB.

• RestartSTB ()

Perform a software reset for the STB.

• SetBlankPage (BOOL OnOff, DWORD Color)

Set a blank or colored screen.

• SetRadioPage (BOOL OnOff)

Set a radio image on the screen.

• SetSTBPower (BYTE OnOff)

Set the Power state of the STB

• SetSTBSerialPort (CHAR *SerialPort)

Set the PC serial port used to communicate with the STB.

Functions for STB Audio Volume Control

• GetSTBAudioMute (BOOL *OnOff)

Page 34: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 30

Get the mute state.

• GetSTBAudioVolume (BYTE *Volume, BYTE *Min, BYTE *Max)

Get the STB Audio Volume and volume limits.

• SetSTBAudioMute (BOOL OnOff)

Set the mute state.

• SetSTBAudioVolume (BYTE Volume)

Set the STB Audio Volume.

Functions for STB OSD

• OSDClearScreen ()

Clear the OSD screen and show video output.

• OSDRefresh ()

Refresh the OSD screen after a change of strings or layout.

• OSDSetLayout (BYTE Layout)

Set the OSD layout.

• OSDSetText (ROW_STRUCT *OSDText, BYTE Struct_no)

Set the text for the main OSD Area.

• OSDShowBitmap (BYTE Bitmap_no, WORD X, WORD Y, BYTE Transparence)

Show a Bitmap.

• OSDShowPopup (BYTE Rows, ROW_STRUCT *PopupText, , BYTE Struct_no, WORD Timeout)

Show Popup Message.

• OSDClearPopup ()

Hide a popup message.

• OSDShowScrollbar (BYTE Value, BYTE Min, BYTE Max)

Show the scrollbar.

• OSDShowVolumeGauge (BYTE Volume, BYTE Min, BYTE Max, WORD Timeout)

Show the volume gauge for the specified time.

Page 35: Deliverable 2.1 DICIT Architecture Tools, Standards ...dicit.fbk.eu/DICIT_D2.1_V1.0_20071005_PU.pdf · 10/5/2007  · beamforming is ensured. The decorrelation is performed by the

D2.1 – DICIT Architecture

DICIT_D2.1_20071005 31

77.. RReeffeerreenncceess

[1] Distant Talking Interfaces for Control of Interactive TV

Annex I - Description of Work 31-May-2006

[2] IBM Embedded ViaVoice Developer’s Guide

June 2007

IBM Corporation

[3] XMLTV

Homepage: http://xmltv.org

Project: http://sourceforge.net/projects/xmltv/

[4] State Chart XML (SCXML)

W3C Working Draft

http://www.w3.org/TR/scxml/