Acoustic Databases

Acoustic Databases

Jan Odijk

ELSNET Summer School, Prague, 2001

Acknowledgements

Part of the slides

have been borrowed from or

are based on work by• Bart D’Hoore• Hugo van Hamme• Robrecht Comeyne• Dirk van Compernolle• Bert van Coile

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Overview


How is it used?


How is it created?

Industrial needs


Linguistic Resources(LRs)

Linguistic Resources are sets of language data in machine readable form that can be used for developing, improving or evaluating language and speech technologies.

Some language and speech technologies• Text-To-Speech (TTS)• Automatic Speech Recognition (ASR)• Dictation• Speaker Verification/recognition• Spoken Dialogue• Audio Mining• Machine Translation• Intelligent Content Management • ….

Linguistic Resources(LRs)Major Types

Electronic Text Corpora• Newspapers, magazines, etc.• Usenet texts, e-mail, correspondence• Etc.

Lexical Resources• Monolingual lexicons• Translation lexicons• Thesauri• …

Acoustic Resources• Annotated Speech Recordings• Annotated Recordings of other acoustic signals

• Coughing, throat clearing, breathing, …

• Door slamming, screeching tires (of a car),…

Types of Linguistic Resources Acoustic Resources

Acoustic Databases (ADBs)• Controlled recording of human speech or other acoustic

signals• Enriched with annotations• Recorded in a digital way• Representative of targeted application environment and

medium• Balanced for phonemes/phoneme combinations• Speaker parameters, recording quality,

environment/medium documented


Annotated unstructured recordings • Broadcasted material• Recorded conversations/monologues/speeches etc• Dictated material• Enriched with annotations


In-service data• Recorded sessions of interaction humans-running

application• Usually by logging a customer system• Enriched with annotations• Used for tuning models, grammars,etc. to specific

application


Environments• “Quiet”

• Studio

• Quiet office

• Normal office• Noisy

• Public place (street, hotel lobby, station, etc.)

• Car (running engine 0km/hr, city, highway)

• Industrial environment


Media• HQ close-talk microphone• Desktop Microphones• Telephone

• analog or digital

• fixed line or mobile• Wide band microphones• Array microphones• PC/PDA etc. low quality microphone

Overview


How is it used?


How is it created?

Industrial needs


Acoustic Resources Use

(for speech synthesis modules in TTS systems)

(as acoustic reference material for pronunciation lexicons)

Mainly for speech recognition

Training and test material for research into new recognition engines and engine features

Training and test material for development of acoustic models

Tuning of acoustic models for specific applications

What is speech recognition?

ASR: Automatic speech recognition

Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text.

Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.

Speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples.

Speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples.

Elements of a Recognizer

SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing


SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Feature Extraction

Turning speech signal into something more manageable• Do analysis once every 10ms• Data compression: 220 byte => 50 byte => 4 byte

Sampling of a signal: transforming into a digital form

Extracting relevant parameters from the signal• Spectral information, energy, pitch,...

Eliminate undesirable elements (normalization)• Noise• Channel properties• Speaker properties (gender)

Feature Extraction: Vectors

Signal is chopped in small pieces (frames), typically 30 ms

Spectral analysis of a speech frame produces a vector representing the signal properties.

=> result = stream of vectors

-4

-3

-2

-1

0

1

2

3

410.31.2-0.9 .0.2


SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Acoustic Model

Split utterance into basic units, e.g. phonemes

The acoustic model describes the typical spectral shape (or typical vectors) for each unit

For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme

Must cope with pronunciation variability• Utterances of the same word by the same speaker are never

identical• Differences between speakers• Identical phonemes sound differently in different words

=> statistical techniques: creation via a lot of examples

S1 S2 S4 S5S3 S13S12S11S6 S7 S8 S9 S10

f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s

Word: series of units specific to the word

Acoustic Model: Units

S1 S2 S4S3

Phoneme: share units that model the same sound

S6 S7 S9 S10S8

Stop

Start

S T PO

S T R TA

Stop

Start


Context dependent phoneme

S|,|T T|S|O P|O|,O|T|P Stop

Stop,S ST OPTO P,

Diphone

Other sub-word units: consonant clusters

ST O P Stop


Phonemes

Phonemes in context: spectral properties depend on previous and following phoneme

Diphones

Sub-words: syllables, consonant clusters

Words

Multi words: example: “it is”, “going to”

Combinations of all of the above


SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Pattern matching

Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model.

= Local score

Calculate score of a word, indicating how well the word matches the string of incoming features (viterbi)

Search algorithm: looks for the best scoring word or word sequence


SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Language Model

Describes how words are connected to form a sentence

Limit possible word sequences

Reduce number of recognition errors by eliminating unlikely sequences

Increase speed of recognizer => real time implementations

Language Model

Two major types• Grammar based

!start <sentence>;

<sentence>: <yes> | <no>;

<yes>: yes | yep | yes please ;

<no>: no | no thanks | no thank you ;

• Statistical

• Probability of single words, 2/3-word sequences

• Derived from frequencies in a large corpus

Active Vocabulary

Lists words that can be recognized by the acoustic model

That are allowed to occur given the language model

Each word associated with a phonetic transcription• Enumerated, and/or• Generated by a Grapheme-to-Phoneme (G2P) module

Post Processing

Re-ordering of Nbest list using other criteria: e.g. account numbers, telephone numbers

Spelling: name search from a list of known names

Applying NLP techniques that fall outside the scope of the statistical language model• E.g. “three dollars fifty cents” “$ 3.50”• “doctor Jones” “Dr. Jones”• Etc.

Training of Acoustic Models

AnnotatedSpeech

Database

PronunciationDictionary

AcousticModel

Training Program


Database design• Coverage of units: word, phoneme, context dependent unit• Coverage of population (region, dialect, age, …)• Coverage of environments (car, telephone, office,..)

Database collection and validation• Checking recording quality• Annotation: describing what people said, extra-speech

sounds

Dictionaries• Phonetic transcription of words• Multiple transcriptions needed• G2P: automatic transcription

Feature vectors 10.31.2-0.9 .0.2

2.1-0.21.9 .-0.3

……...8.1-0.51.3 .0.2

......

Example: discrete models

A collection of prototypes is constructed (100 to 250)

Each vector is replaced by its nearest prototype

-6

-4

-2

0

2

4

6

8

-4 -2 0 2 4 6 8

Vectoren

Prototypes

Prototypes

Feature vectors 10.31.2-0.9 .0.2

2.1-0.21.9 .-0.3

……...8.1-0.51.3 .0.2

......

,,,3 9 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,,

ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz

2276998900023448889211127780128897791237787622

f r E n d l I k O m p j u t $ z ,0 3 1 11 1 1 1 22 2 1 2 1 2 13 1 14 256 2 27 2 2 2 28 1 3 1 19 2 1 1 2

f r E n d l I k O m p j u t $ z ,0 0 0 0 75 0 0 0 50 0 50 0 0 0 0 0 0 01 0 0 0 0 0 0 0 50 0 50 0 0 0 33 0 0 672 100 0 0 0 50 0 0 0 0 0 100 0 0 33 0 100 333 0 0 0 0 50 0 0 0 0 0 0 0 0 33 0 0 04 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 50 0 0 0 0 0 0 0 0 0 100 0 0 0 0 07 0 50 0 0 0 0 0 0 67 0 0 0 50 0 67 0 08 0 0 33 0 0 0 75 0 33 0 0 0 0 0 33 0 09 0 0 67 25 0 0 25 0 0 0 0 0 50 0 0 0 0

Prototypes

Phonemeassignment


Create New Models

For all utterances in database:

Make phonetic transcription of a sentence

Use models to segment the utterance file: assign a phoneme to each speech frame

Collect statistical information:Count prototype-phoneme occurrences

Key Element in ASR

ASR is based on learning from observations• Huge amount of spoken data needed for making acoustic

models• Huge amount of text data needed for making language

models

=> Lots of statistics, few rules

Overview


How is it used?


How is it created?

Industrial needs


Contents of an ADB

Utterances of different utterance types

Utterance types suited to the intended application domain

Text balanced for phoneme and/or diphone distribution

All enriched with annotations

Contents of an ADBSpontaneous v. Read Utterances A spontaneous utterance is a response to a

question or a request “In which city do you live?” “Please spell a letter heading to your secretary” “Is English your mother tongue?” “Make a hotel reservation”

A read utterance is an utterance read from a presentation text “London” “Dear John” “Yes” “Please book me a room for 2 persons with bath. We will

arrive ….”

Contents of an ADB

Isolated Phonetically Rich Word Apple Tree, Lobster

Isolated Digit 5

Isolated Alphabet B

Isolated number (natural number) 4256

Contents of an ADB

Continuous Digits 9 1 1

Continuous Alphabet Y M C A

Commands Stop, left, print, call, next

Contents of an ADBConnected Digits

Telephone Numbers 057/228888

Credit Card Numbers 3741 959289 310001

Pin-codes 8978

Social Security Number 560228 561 80

Other identification numbers, e.g. sheet id 012589225712

Contents of an ADBTime and Date Expressions

Time (“analog”, word style) A quarter past two

Time (“digital”) 14:15 2:15PM

Date (“analog”, word style, absolute) Friday, June 25th, 1999 Christmas’ Eve, Easter

Date (“digital”, absolute) 25/06/99

Date (“analog”, word style, relative) Tomorrow, next week, in one month

Contents of an ADB

Money amounts $327.67 £148.95

Isolated Phonetically Rich Sentences A cold supper was ordered and a bottle of port

Isolated Command Sentences Insert this name in the list

Names Microsoft, New York, Jonathan

Syllables Hi-ta-chi

Contents of an ADB

Continuous Phonetically Rich Sentences Once upon a time, in a land far from here, lived a little

princess. She was the most beautiful girl…

Continuous Command Sentences Select the first line. Make it bold and move it to the bottom

of the text…

Continuous Spontaneous speech <Make a reservation in a hotel>

Contents of an ADBContents of SpeechDat-II ADB

For each speaker/session Approx. 40 utterances Duration approx. 10 minutes Mixture of read and spontaneous utterances Mixture of

Phonetically rich sentencesApplication specific wordsUtterance types that will often occur in any application

1000-5000 speakers/sessions


1 isolated digit

4-digit id (sheet number)

3 connected digits (~10-digit telephone number)

12-digit credit card number

3 natural numbers

2 money amounts: 1 large, 1 small

3 spelled words

1 time of day (spontaneous)


1 time phrase (read, word style)

1 date (spontaneous, e.g. person’s birthday)

2 dates (read, word style)

3 yes/no questions

1 city of call/birth

6 common application words out of 50

3 application word phrases

9 sentences (read)

Contents of an ADBContents of SpeechDat-Car ADB For each session

Approx. 120-130 utterances (depending on session) Duration 2-3 hours Mixture of read and spontaneous utterances Mixture of

Phonetically rich sentencesApplication specific wordsUtterance types that will often occur in any application

600 sessions with min. 300 speakers

In 2 out of 7 conditions Standing still/ Low speed/ high speed Different road conditions/ surrounding noise Audio equipment on/off

Contents of an ADBContents of SpeechDat-Car ADB Digits and Digit Strings

1 sequence of 10 digits 1 sheet number (4+ digit sequence) 1 spontaneous telephone number 1 credit card number (16 digits) 1 PIN code (6 digits) 4 isolated digits

Dates 1 spontaneous date (e.g. birthday) 1 prompted date, word style 1 relative and general date expression

Contents of an ADBContents of SpeechDat-Car ADB Names

1 spontaneous name (e.g. speaker’s first name) 1 city of growing up (spontaneous) 2 most frequent city’s 2 company / agency / street names 1 person name (first name or surname)

Spellings 1 spontaneous spelled name (e.g. speaker’s first name) 1 spelling of city name 4 real words or names 1 artificial name (for coverage)

Contents of an ADBContents of SpeechDat-Car ADB Money Amounts/Natural Numbers

1 money amount 1 natural number

Times 1 time of day (spontaneous) 1 time phrase (word style)

Phonetically Rich words 4 phonetically rich words

Contents of an ADBContents of SpeechDat-Car ADB Application words

13 Mobile Phone Application words 22 IVR functions keywords 32 car products keywords 2 voice activation keywords 2 language dependent keywords

Sentences 2 phrases using an application word 9 phonetically rich sentences 10 prompts for spontaneous speech

Overview


How is it used?


How is it created?

Industrial needs


Phases in Acoustic Resource Creation

Design

Creation of Script

Recruitment & Recording

Annotation and Validation

Lexicon

Quality control

Production

DESIGN

Language study:

• Phoneme set

• Dialects

Scripting: utterance definition and distribution over

speakers

Speaker typology: distribution definition

• Sex, gender, age, dialects, educational level

Recording: specification of procedure and platform

Validation: specification of procedure and quality standard

Creation of ScriptPrompts/Text/Transcription

A prompt refers to the way an utterance is presented to the speaker. This can be done on the desktop, on paper or with a play back file (telephony).

The (presentation) text represents the utterance as it should be pronounced by the speaker. It is normally presented according to the spelling conventions of the target language.

The transcription is the utterance as it has been pronounced by the speaker.

EXAMPLE: The pronunciation of a digit string• PROMPT: “Please read the number on top of your form”• TEXT: 578124• TRANSCRIPTION: five seven eight one two four

Creation of Script

Collect and clean text corpora Split cleaned text into a sequence of sentences

Remove ungrammatical and too long sentences

Remove sentences containing offensive language

Remove (certain) ambiguities in pronunciation

numbers, dates, abbreviations, etc.

Apply phonetic balancing tools to obtain phonetically rich

text

Creation of Script

Collect and/or create other utterance types

• Telephone numbers, amounts, credit card numbers, etc.

Create prompts

• Prerecorded messages to the speaker

• For unmonitored recording without access to screen

(telephony)

Put all these in resource files

Creation of ScriptScript File

Configuration:

• Acquisition board, Coding type

• Sampling rate, Number of channels

Information items

• Speaker id, Sheet id

• Gender, age, region of birth, region youth, living, etc. and their

possible values

• Recording environment/conditions

Sentence definitions

• Specifies order and types of utterances in one session

Creation of Script

Resource files utterance sheets

Generate letter with instructions and list of

utterances for each speaker (esp. telephony)

Creation of ScriptTools

Script Editor

• For creating/modifying scripts

• For creating utterance sheet files (from resource files)

• For generating letters to speakers

Digit String Generator

• Natural numbers

• Bank accounts

• Credit card numbers

• Phone numbers

• Pin-codes

Creation of Script

Test the script

• By making one or more recording sessions

• Also tests the recording set-up

• also provides indications for average session duration

RECRUITMENT

Contact potential speakers according to the typology• Acquaintances, colleagues

• Advertisements

• Employees/students of cooperating organizations (companies, universities)

• Possibly with the help of marketing agencies

Explain • purpose and context

• What the speaker is supposed to do

• How much time it will take

• Reimbursement for the speaker (time spent, travel costs)

Make concrete arrangements with the speakers

Locations : fitting environment definition

Set-up : recording platform

Interview : log speaker typology & recording conditions

Instructions : what do we expect from a speaker

Recording : follow-up on quality

RECORDING

Locations : Set up recording equipment in environment fitting

environment definition

Set-up recording platform and test it

Welcome speaker, instruct speaker

Interview : log speaker typology & recording conditions

Make recordings and follow-up on quality

Deal with administrative matters

• Agreement on ownership of recording

• Reimbursement

RECORDINGTOOL

VALIDATION and ANNOTATION

After recording, the signal will NEVER be touched

• Only enriched with annotations

check (and correct) relation between text & speech

• Orthographic transcription must represent what the speaker said

• Tool to expand abbreviations, numbers, digit sequences

Segmentation

• Check (and correct) begin and end of speech markers

• (mostly for TTS) Mark begin and end of phonemes

VALIDATION and ANNOTATION

assign quality label

• Very good overall quality … very bad overall quality

Annotations for extra events

• Speaker sounds (coughing, breathing, swallowing, …)

• Mispronunciations, truncations

• Sound from other sources (other speaker, music, radio, …)

• Continuous background noise (wind, rain, …)

• Filled pauses (uh, um, er, ah, hmm, ….)

• Telephone distortions

Validation Tool

Semi-Automatic Validation

Validation can be partially automated

For certain types of databases

70-75% reliably validated automatically

25-30% require manual check

Using ASR systems

Research into further automating this ongoing

LEXICON

One central “mother lexicon” for each language

• To reduce duplication of effort

• To maintain consistency

Request is compared with mother database

• Unfound entries imported in the mother database

• Unfound entries turned into a job

• Job is assigned to linguists

LEXICON

After finishing job

• Requested entries and properties are exported

• Turned into required format

• Delivered to requestor

Additions/modifications due to this request are

now available for other requests

LEXICON: Tools

Phoned

• Lexical database plus user interface

• (currently in Access but switching to SQL Server)

• Reuse of G2P and Synthesis Modules

PhonedAdmin

• import and export of data from the mother database

• Comparison with existing mother database

• Definition of users and jobs

• Assignment of jobs to users

QUALITY CONTROL

Typical Circumstances Database project is ongoing Often on a remote location Multiple persons (for recording and validation) Many questions, problems and unclarities arise constantly Require answers from specialists Danger of errors and inconsistencies

Within the work of a single person between different persons

Constant monitoring Systematic and regular quality checks required Systematic and regular feedback required

During the whole project From the earliest moment possible

Documentation, incl. spot check report

QUALITY CONTROL

Tools

• ADB Scanner—checks consistency of database

• Standard structure, All files available

• ADB Statistics

• Statistics on information items (sex, gender, age,

dialect, quality, etc.) and utterance types

• ADB Report Tool

• For creating parts of the documentation

• And others

PRODUCTION

Huge amount of data!

Multiple copies needed

Special fast CD-replicator equipment

Special cupboards for storing the CDs

Description in catalogue

Distribution

Conversion Tools (format converter, down sampling,

demultiplexing)

DAR Resource Description

DAR Resource Description: Statistics

Overview


How is it used?


How is it created?

Industrial needs


General

More data!

The right data!

High Quality data

In-service Data

ASAP

SpeechDat Family

Consortium of industrial and university partners Often EU projects

One type of database is defined

Each partner makes one database according to spec

Each database is validated by external organization (SPEX, Nijmegen, the Netherlands)

After approval databases are exchanged among the partners

Max. 1-1.5 yr later data are offered for distribution by ELRA

http://www.speechdat.org/




Overview of major projects

SpeechDat (M)

SpeechDat-II

SpeechDat-E

SpeechDat-Car

SPEECON

SALA I

SALA II

SpeechDat (M)

EU-funded

production, standardization, evaluation and dissemination of Spoken Language Resources

8 fixed telephone network databases, 1000 speakers each; 1 mobile telephone network database, 300 speakers

Period: 1994-1996

SpeechDat (M)Partners

Siemens

Philips

Vocalis

CSELT

UPC

IDIAP

INESC

GEC MSIS

SpeechDat (M)Languages

German

French

Danish

Italian

Spanish

Portuguese

Swiss French

SpeechDat-II

EU-funded

Creation of Telephony Databases

25 fixed and mobile telephone network databases, 500-5000 speakers each; 3 speaker verification databases

Period: 1996-1998

SpeechDat-IIPartners

Aalborg University

Auditex

British Telecom

CSELT

DMI

ELRA

GEC

GPT

IDIAP

INESC

Knowledge S.A.

KTH

Lernout & Hauspie

Matra Nortel

Philips

Portugal Telecom

Siemens

SPEX

Swiss Telecom

Telenor

Univ. of Maribor

Univ. of Munich

Univ. of Patras

UPC

Vocalis

SpeechDat-IILanguages

Danish

Flemish

Belgian French

Luxemburg German

Luxemburg French

British English

Welsh

Finnish

Finnish Swedish

French French

Dutch

Swiss French

Swiss German

German

Slovenian

Greek

Italian

Portuguese

Spanish

Swedish

Norwegian

SpeechDat-E

EU-funded Eastern European Speech Databases for Creation

of Voice Driven Teleservices

Speech databases for fixed telephone networks suited for typical present-day teleservices plus phonetically rich set of material for vocabulary

independent ASR

1000 – 2500 speakers

Period: 1999-2001

SpeechDat-EPartners

Auditex Lernout & Hauspie Philips Speech

Processing Siemens ELRA SPEX

Brno University of Technology

Prague Technical University

Budapest University of Technology

Wroclaw University of Technology

Slovak Academy of Sciences

SpeechDat-ELanguages

Russian (2500) Czech Slovak Hungarian Polish

SpeechDat-Car

EU-funded

9 in-vehicle and mobile telephone network databases

300 speakers, each in 2 out of 7 conditions (600 recording sessions)

5 simultaneous channels

Period: Apr 1998 - Oct 2000

SpeechDat-CarPartners

Aalborg University

Alcatel

Robert Bosch GmbH

DMI

ELRA

Knowledge S.A.

Lernout & Hauspie

L&H France (formerly Matra Nortel)

Nokia

Renault

SEAT

SPEX

University of Munich

UPC

Vocalis

Volkswagen

SpeechDat-CarLanguages

Danish

British English

Finnish

Flemish/Dutch

French

German

Greek

Italian

Spanish

American English

SPEECON

Speech driven interfaces for consumer devices

Speech databases for voice controlled consumer applications • television sets, video recorders, mobile phones, palmtop

computers, car navigation kits or even microwave ovens and toasters.

600 speakers

Period: 2000-2003

SPEECONPartners

DaimlerChrysler

Ericsson

IBM

Lernout & Hauspie

Natural Speech Communications

Nokia

Philips Speech Processing

Siemens

Sony

Temic Telefunken

SPEECONLanguages

EU Spanish

Russian

Italian

Swedish

German

UK English

Danish

Flemish

US English

US Spanish

Hebrew

French

Finnish

Mandarin

Dutch

Japanese

Polish

Portuguese

Swiss German

Cantonese

SALA I

SpeechDat Across Latin America

Not government-subsidized

Speech databases for fixed telephony, Latin America

1000-2000 speakers per database

Period: 1998-2001

SALAPartners

CSELT

ELRA

Lernout & Hauspie

Lucent

Philips

Siemens

SPEX

UPC

Vocalis

SALALanguages

Brasil (Portuguese, 2000)

Mexico (2000)

Caribbean islands and Venezuela

Central America

Panama, Columbia

Ecuador, Peru, Bolivia

Chile

Argentina, Uruguay, Paraguay

SALA II

Not government-subsidized

to create speech databases telephone cellular oriented applications

America (North and Latin)

1000 (or 2000) speakers

Period: 2001-2002

(project just starting up)

SALA II Partners

ATLAS

ELRA

IBM

Lernout&Hauspie

Loquendo

Lucent

NSC

Philips

Siemens

SPEX

UPC

SALA II Languages

Venezuela

Peru

Mexico

Chile

Argentina

Costa Rica

Brasil

Colombia

American English Canada

US English North East

US Spanish East

US English South East

US Spanish West

US English North West

Future

Non-native/multilingual ASR

Data for Speech-to-Speech Translation

Access to information • anytime• anywhere• by way of any device

More use of spontaneous speech (“conversational systems”)

Future

Devices will become • increasingly smaller (“mobile”)• Increasingly more powerful• Connected to information sources such as Internet etc robustness against different environments

Input/Output• Limited• Keyboard and screens less convenient• Opportunity for speech input and output• Other input/output methods get different roles Multi-modal input and output systems

Future

Distributed systems• Part of the recognition/synthesis on the local system

(“client”)• Part on the server• Dynamically adaptable local systems

In car, speech is • “Hands-free” and• “Eyes-free” solution

Overview


How is it used?


How is it created?

Industrial needs


Why Speech User Interface

Pro• Audio feedback draws

attention• Complex commands E.G.

Control your VCR• Fast and simple - Chinese !!!

• Speech input: 50-250 wpm

• Typing: 20-90 wpm

• Handwriting: 25 wpm

• Pointing: 10-40/min• Eyes free• Hands fee• Mobile• Compact i/o devices

Con• audio messages difficult to

remember if too long. E.g. telephone number, address

• “a drawing can replace a thousand words”

• privacy• sometimes cumbersome.

E.g. control a cursor on a screen

• voice wear-out

Text-to-Speech engines

Processor power & memory

Voice Quality

RealSpeak

RealSpeakCompact

RealSpeakUltraCompact

Human-like

Machine-like

TTS3000

TTS2500

low high

Text-to-Speech engines

TTS2500• Low quality, small footprint engine for talking dictionary products• Available, no additional R&D

TTS3000• Medium quality engines• Limited footprint, high densities• Limited developments

RealSpeak compact• Target: handheld devices

RealSpeak• High-end system

RealSpeak TTS

New generation, human sounding TTS

Target: server based telephony, PCMM

Platform Requirements: • CPU: 48 real time instances PIII 450MHz (8 kHz speech data)• RAM: < 250 kB/instance ROM: 4-6 MB

Speechbases:• 8 kHz uncompressed: ~ 250 MB• 8, 11 kHz compressed: 20 – 30 MB• 22 kHz compressed: 70 – 90 MB

20 languages: US English, 15 European and 4 Asian languages

2 languages under development

RealSpeak Compact

High quality, medium footprint TTS

Target: mobile and embedded platforms

Platform Requirements: • 150 MIPS• RAM: < 250 kB/instance; 4-6 MB common• ROM: 16 MB (includes 11 kHz speechbase)

Derived automatically from RealSpeak

RealSpeak ultra compact under development

TTS3000

Low footprint, highly intelligible TTS engine

Target: Telephony, PCMM, Mobile, Embedded

Platform requirements:• CPU: 20 – 30 MIPS• RAM: 100 kB/instance; ROM: 2 - 3 MB

13 languages including:• US English• 7 European• 3 Asian languages

2 languages under development

TTS2500

Dedicated TTS for very low footprint talking dictionaries

Analysis on 8 or 16 bit processor: <2 Mips

Synthesis on dedicated chip (LH3010 or LH3030 ) or DSP (ADSP21xx)

1.5 MB ROM, 16 KB RAM

Languages:• American English• Mandarin Chinese• Mexican Spanish• German• French

Dimensions of ASR

Speaker• Independent - adaptive - dependent• Native - non-native• Man, woman, child

Recording conditions• Recording device: telephone, GSM, microphone, tape

recorder• Environment: quiet office, home, car, factory, street…

Implementation• Platform: PC, embedded• CPU and memory

Dimensions of ASR

Size of the (active) vocabulary• Small (10-100) - medium (100-1000) - large(>1000) - very large

(>10000)

Flexibility of the vocabulary• Fixed (factory-definable)- User-definable• phoneme based => speaker independent• user words => speaker dependent

Word sequences • Isolated words - sentences - word spotting• Fixed grammar - flexible language model• Discrete - continuous speech

Language• Language independent engine, language dependent data files• Swapping language files

Different Applications, Different Needs

Dictation• Speaker dependent, large vocabulary, continuous speech, quiet office,

PC

Command & control, name dialing• Speaker independent, small to large vocabulary, noise robust, DSP

boards and/or client-server

Dialogue systems• Speaker independent, medium to large vocabulary, noise robust, client-

server

Security: verification• Speaker dependent, combination of password (what) + speaker

characteristics (who)

Language learning• Non-native speakers, punish mistakes rather than being tolerant

Automatic Speech Recognition

L&H speech recognition engines cover a broad range of tasks, processor types, operating systems and input signal types:

Tasks: • Large vocabulary continuous real-time dictation, • Large vocabulary batch transcription, • Grammar-based recognition – large, medium and small vocabularies, • Small-vocabulary isolated word recognition.

Platforms: • PC, • Server, • Handheld, Embedded, • Distributed.

Automatic Speech Recognition engines

Processor power & memory

XCalibur,MREC

VX

Task Complexity

ASR1600ASR1500

Large VocabularyOpen GrammarDictation

Large VocabularyClosed Grammar

Medium VocabularyClosed Grammar

Isolated word recognition

Server

Mobile Terminal

ASR300

ASR100

low high

Recognition engines …

Input conditions: • Environments: home, office, public/industrial, car.• Channels: telephone (wireline, wireless), wideband, mobile

devices.• Microphones: close-talking, far-talking.• Combinations: e.g., broadcast material.

A wide range of processor/memory operating points: • 200Mips/32MB, • 60Mips/1MB, • 20Mips/300KB, • 5-10Mips/<30KB

Recognition engines ….

ASR100: • 5-10 Mips, • < 30 KB, • Speaker-dependent• Recording device: mic./phone• Sampling Frequency: 8/11 kHz• Environment: office• Vocabulary: small and user-adaptable• Grammar: isolated• Speech: Isolated• OS: various• Architecture: stand-alone• Languages: language-independent

Applications• embedded• cell-phone dialing• toys


ASR300: • 20 Mips, • 300 KB, • SI & SD• Sampling Frequency: 8/11 kHz• Vocabulary: small and factory-adaptable• Highly noise robust• Environment: office/car/other noisy environments• Unit: word-dependent• Grammar: isolated• Speech: quasi-connected command and control • OS: various• Architecture: stand-alone• Languages: US English, French, Italian, Korean, German, Japanese

Applications• In car command and control• Command and control of toys, games• Command and control in noisy industrial environments


ASR1500• 60 Mips, • 1MB • SI and speaker-adaptive• Vocabulary: medium size; user-adaptable• Sampling Frequency: 8kHz• Environment: office• Recording Device: telephone/ mobile phone• Grammar: finite state• Speech: Continuous• Unit: phoneme• OS: various• Architecture: Stand-alone and client/server• Languages: US English, 10 European languages, 4 Asian languages

Applications• IVR applications over the phone

• Reverse directory, Automated attendant• Information provider— stock quotes• Ordering systems

• ASR1600 highly noise robust


ASR1600• 60 Mips, • 1MB • SI and speaker-adaptive• Vocabulary: medium size; user-adaptable• Sampling Frequency: 11kHz• Environment: office, car; highly noise-robust• Recording Device: mic.• Grammar: finite state• Speech: Continuous• Unit: phoneme• OS: various• Architecture: Stand-alone and client/server• Languages: US English, 10 European languages, 4 Asian languages

Applications• In-Car recognition

• Command and Control• Embedded devices

• PDA’s, SmartPhones


Mrec/VX: • > 200 Mips• > 64MB• SI and speaker-adaptive • Vocabulary: very large (> 64,000 words)• Sampling Frequency: 22 (16) kHz• Environment: Office• Recording Device: mic.• Grammar: statistical• Speech: Continuous• Unit: phoneme• OS: Windows• Architecture: Stand-alone• Languages: US English and Spanish, 7 European languages, 2 Asian

languages

Applications• document creation, incl. command and control• MediaIndexer (Mrec)• Speech Transcription (Mrec)


Xcalibur • scalable• SI and speaker-adaptive • Vocabulary: very large (> 64,000 words)• Sampling Frequency: 22 (16) kHz• Environment: Office, (Telephony, Car)• Recording Device: mic.• Grammar: statistical and rule-based• Speech: Continuous• Unit: phoneme• OS: Windows• Architecture: Stand-alone and client-server• Languages: Currently only Japanese

Applications• document creation• command and control • Focus on conversational systems

Documents

Acoustic Databases