133
Acoustic Databases Jan Odijk ELSNET Summer School, Prague, 2001

Acoustic Databases

  • Upload
    kesia

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

Acoustic Databases. Jan Odijk ELSNET Summer School, Prague, 2001. Acknowledgements. Part of the slides have been borrowed from or are based on work by Bart D’Hoore Hugo van Hamme Robrecht Comeyne Dirk van Compernolle Bert van Coile. Overview. What is a speech database? - PowerPoint PPT Presentation

Citation preview

Page 1: Acoustic Databases

Acoustic Databases

Jan Odijk

ELSNET Summer School, Prague, 2001

Page 2: Acoustic Databases

Acknowledgements

Part of the slides

have been borrowed from or

are based on work by• Bart D’Hoore• Hugo van Hamme• Robrecht Comeyne• Dirk van Compernolle• Bert van Coile

Page 3: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 4: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 5: Acoustic Databases

Linguistic Resources(LRs)

Linguistic Resources are sets of language data in machine readable form that can be used for developing, improving or evaluating language and speech technologies.

Some language and speech technologies• Text-To-Speech (TTS)• Automatic Speech Recognition (ASR)• Dictation• Speaker Verification/recognition• Spoken Dialogue• Audio Mining• Machine Translation• Intelligent Content Management • ….

Page 6: Acoustic Databases

Linguistic Resources(LRs)Major Types

Electronic Text Corpora• Newspapers, magazines, etc.• Usenet texts, e-mail, correspondence• Etc.

Lexical Resources• Monolingual lexicons• Translation lexicons• Thesauri• …

Acoustic Resources• Annotated Speech Recordings• Annotated Recordings of other acoustic signals

• Coughing, throat clearing, breathing, …

• Door slamming, screeching tires (of a car),…

Page 7: Acoustic Databases

Types of Linguistic Resources Acoustic Resources

Acoustic Databases (ADBs)• Controlled recording of human speech or other acoustic

signals• Enriched with annotations• Recorded in a digital way• Representative of targeted application environment and

medium• Balanced for phonemes/phoneme combinations• Speaker parameters, recording quality,

environment/medium documented

Page 8: Acoustic Databases

Types of Linguistic Resources Acoustic Resources

Annotated unstructured recordings • Broadcasted material• Recorded conversations/monologues/speeches etc• Dictated material• Enriched with annotations

Page 9: Acoustic Databases

Types of Linguistic Resources Acoustic Resources

In-service data• Recorded sessions of interaction humans-running

application• Usually by logging a customer system• Enriched with annotations• Used for tuning models, grammars,etc. to specific

application

Page 10: Acoustic Databases

Types of Linguistic Resources Acoustic Resources

Environments• “Quiet”

• Studio

• Quiet office

• Normal office• Noisy

• Public place (street, hotel lobby, station, etc.)

• Car (running engine 0km/hr, city, highway)

• Industrial environment

Page 11: Acoustic Databases

Types of Linguistic Resources Acoustic Resources

Media• HQ close-talk microphone• Desktop Microphones• Telephone

• analog or digital

• fixed line or mobile• Wide band microphones• Array microphones• PC/PDA etc. low quality microphone

Page 12: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 13: Acoustic Databases

Acoustic Resources Use

(for speech synthesis modules in TTS systems)

(as acoustic reference material for pronunciation lexicons)

Mainly for speech recognition

Training and test material for research into new recognition engines and engine features

Training and test material for development of acoustic models

Tuning of acoustic models for specific applications

Page 14: Acoustic Databases

What is speech recognition?

ASR: Automatic speech recognition

Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text.

Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.

Speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples.

Speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples.

Page 15: Acoustic Databases

Elements of a Recognizer

SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Page 16: Acoustic Databases

Elements of a Recognizer

SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Page 17: Acoustic Databases

Feature Extraction

Turning speech signal into something more manageable• Do analysis once every 10ms• Data compression: 220 byte => 50 byte => 4 byte

Sampling of a signal: transforming into a digital form

Extracting relevant parameters from the signal• Spectral information, energy, pitch,...

Eliminate undesirable elements (normalization)• Noise• Channel properties• Speaker properties (gender)

Page 18: Acoustic Databases

Feature Extraction: Vectors

Signal is chopped in small pieces (frames), typically 30 ms

Spectral analysis of a speech frame produces a vector representing the signal properties.

=> result = stream of vectors

-4

-3

-2

-1

0

1

2

3

410.31.2-0.9 .0.2

Page 19: Acoustic Databases

Elements of a Recognizer

SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Page 20: Acoustic Databases

Acoustic Model

Split utterance into basic units, e.g. phonemes

The acoustic model describes the typical spectral shape (or typical vectors) for each unit

For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme

Must cope with pronunciation variability• Utterances of the same word by the same speaker are never

identical• Differences between speakers• Identical phonemes sound differently in different words

=> statistical techniques: creation via a lot of examples

Page 21: Acoustic Databases

S1 S2 S4 S5S3 S13S12S11S6 S7 S8 S9 S10

f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s

Page 22: Acoustic Databases

Word: series of units specific to the word

Acoustic Model: Units

S1 S2 S4S3

Phoneme: share units that model the same sound

S6 S7 S9 S10S8

Stop

Start

S T PO

S T R TA

Stop

Start

Page 23: Acoustic Databases

Acoustic Model: Units

Context dependent phoneme

S|,|T T|S|O P|O|,O|T|P Stop

Stop,S ST OPTO P,

Diphone

Other sub-word units: consonant clusters

ST O P Stop

Page 24: Acoustic Databases

Acoustic Model: Units

Phonemes

Phonemes in context: spectral properties depend on previous and following phoneme

Diphones

Sub-words: syllables, consonant clusters

Words

Multi words: example: “it is”, “going to”

Combinations of all of the above

Page 25: Acoustic Databases

Elements of a Recognizer

SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Page 26: Acoustic Databases

Pattern matching

Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model.

= Local score

Calculate score of a word, indicating how well the word matches the string of incoming features (viterbi)

Search algorithm: looks for the best scoring word or word sequence

Page 27: Acoustic Databases
Page 28: Acoustic Databases
Page 29: Acoustic Databases
Page 30: Acoustic Databases
Page 31: Acoustic Databases

Elements of a Recognizer

SpeechData

NaturalLanguage

Understanding

Action

Displaytext

Meaning

AcousticModel

LanguageModel

PatternMatching

FeatureExtraction

Post Processing

Page 32: Acoustic Databases

Language Model

Describes how words are connected to form a sentence

Limit possible word sequences

Reduce number of recognition errors by eliminating unlikely sequences

Increase speed of recognizer => real time implementations

Page 33: Acoustic Databases

Language Model

Two major types• Grammar based

!start <sentence>;

<sentence>: <yes> | <no>;

<yes>: yes | yep | yes please ;

<no>: no | no thanks | no thank you ;

• Statistical

• Probability of single words, 2/3-word sequences

• Derived from frequencies in a large corpus

Page 34: Acoustic Databases

Active Vocabulary

Lists words that can be recognized by the acoustic model

That are allowed to occur given the language model

Each word associated with a phonetic transcription• Enumerated, and/or• Generated by a Grapheme-to-Phoneme (G2P) module

Page 35: Acoustic Databases

Post Processing

Re-ordering of Nbest list using other criteria: e.g. account numbers, telephone numbers

Spelling: name search from a list of known names

Applying NLP techniques that fall outside the scope of the statistical language model• E.g. “three dollars fifty cents” “$ 3.50”• “doctor Jones” “Dr. Jones”• Etc.

Page 36: Acoustic Databases

Training of Acoustic Models

AnnotatedSpeech

Database

PronunciationDictionary

AcousticModel

Training Program

Page 37: Acoustic Databases

Training of Acoustic Models

Database design• Coverage of units: word, phoneme, context dependent unit• Coverage of population (region, dialect, age, …)• Coverage of environments (car, telephone, office,..)

Database collection and validation• Checking recording quality• Annotation: describing what people said, extra-speech

sounds

Dictionaries• Phonetic transcription of words• Multiple transcriptions needed• G2P: automatic transcription

Page 38: Acoustic Databases

Feature vectors 10.31.2-0.9 .0.2

2.1-0.21.9 .-0.3

……...8.1-0.51.3 .0.2

......

Page 39: Acoustic Databases

Example: discrete models

A collection of prototypes is constructed (100 to 250)

Each vector is replaced by its nearest prototype

-6

-4

-2

0

2

4

6

8

-4 -2 0 2 4 6 8

Vectoren

Prototypes

Page 40: Acoustic Databases

Prototypes

Feature vectors 10.31.2-0.9 .0.2

2.1-0.21.9 .-0.3

……...8.1-0.51.3 .0.2

......

,,,3 9 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,,

Page 41: Acoustic Databases

ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz

2276998900023448889211127780128897791237787622

f r E n d l I k O m p j u t $ z ,0 3 1 11 1 1 1 22 2 1 2 1 2 13 1 14 256 2 27 2 2 2 28 1 3 1 19 2 1 1 2

f r E n d l I k O m p j u t $ z ,0 0 0 0 75 0 0 0 50 0 50 0 0 0 0 0 0 01 0 0 0 0 0 0 0 50 0 50 0 0 0 33 0 0 672 100 0 0 0 50 0 0 0 0 0 100 0 0 33 0 100 333 0 0 0 0 50 0 0 0 0 0 0 0 0 33 0 0 04 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 50 0 0 0 0 0 0 0 0 0 100 0 0 0 0 07 0 50 0 0 0 0 0 0 67 0 0 0 50 0 67 0 08 0 0 33 0 0 0 75 0 33 0 0 0 0 0 33 0 09 0 0 67 25 0 0 25 0 0 0 0 0 50 0 0 0 0

Prototypes

Phonemeassignment

Page 42: Acoustic Databases

Training of Acoustic Models

Create New Models

For all utterances in database:

Make phonetic transcription of a sentence

Use models to segment the utterance file: assign a phoneme to each speech frame

Collect statistical information:Count prototype-phoneme occurrences

Page 43: Acoustic Databases

Key Element in ASR

ASR is based on learning from observations• Huge amount of spoken data needed for making acoustic

models• Huge amount of text data needed for making language

models

=> Lots of statistics, few rules

Page 44: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 45: Acoustic Databases

Contents of an ADB

Utterances of different utterance types

Utterance types suited to the intended application domain

Text balanced for phoneme and/or diphone distribution

All enriched with annotations

Page 46: Acoustic Databases

Contents of an ADBSpontaneous v. Read Utterances A spontaneous utterance is a response to a

question or a request “In which city do you live?” “Please spell a letter heading to your secretary” “Is English your mother tongue?” “Make a hotel reservation”

A read utterance is an utterance read from a presentation text “London” “Dear John” “Yes” “Please book me a room for 2 persons with bath. We will

arrive ….”

Page 47: Acoustic Databases

Contents of an ADB

Isolated Phonetically Rich Word Apple Tree, Lobster

Isolated Digit 5

Isolated Alphabet B

Isolated number (natural number) 4256

Page 48: Acoustic Databases

Contents of an ADB

Continuous Digits 9 1 1

Continuous Alphabet Y M C A

Commands Stop, left, print, call, next

Page 49: Acoustic Databases

Contents of an ADBConnected Digits

Telephone Numbers 057/228888

Credit Card Numbers 3741 959289 310001

Pin-codes 8978

Social Security Number 560228 561 80

Other identification numbers, e.g. sheet id 012589225712

Page 50: Acoustic Databases

Contents of an ADBTime and Date Expressions

Time (“analog”, word style) A quarter past two

Time (“digital”) 14:15 2:15PM

Date (“analog”, word style, absolute) Friday, June 25th, 1999 Christmas’ Eve, Easter

Date (“digital”, absolute) 25/06/99

Date (“analog”, word style, relative) Tomorrow, next week, in one month

Page 51: Acoustic Databases

Contents of an ADB

Money amounts $327.67 £148.95

Isolated Phonetically Rich Sentences A cold supper was ordered and a bottle of port

Isolated Command Sentences Insert this name in the list

Names Microsoft, New York, Jonathan

Syllables Hi-ta-chi

Page 52: Acoustic Databases

Contents of an ADB

Continuous Phonetically Rich Sentences Once upon a time, in a land far from here, lived a little

princess. She was the most beautiful girl…

Continuous Command Sentences Select the first line. Make it bold and move it to the bottom

of the text…

Continuous Spontaneous speech <Make a reservation in a hotel>

Page 53: Acoustic Databases

Contents of an ADBContents of SpeechDat-II ADB

For each speaker/session Approx. 40 utterances Duration approx. 10 minutes Mixture of read and spontaneous utterances Mixture of

Phonetically rich sentencesApplication specific wordsUtterance types that will often occur in any application

1000-5000 speakers/sessions

Page 54: Acoustic Databases

Contents of an ADBContents of SpeechDat-II ADB

1 isolated digit

4-digit id (sheet number)

3 connected digits (~10-digit telephone number)

12-digit credit card number

3 natural numbers

2 money amounts: 1 large, 1 small

3 spelled words

1 time of day (spontaneous)

Page 55: Acoustic Databases

Contents of an ADBContents of SpeechDat-II ADB

1 time phrase (read, word style)

1 date (spontaneous, e.g. person’s birthday)

2 dates (read, word style)

3 yes/no questions

1 city of call/birth

6 common application words out of 50

3 application word phrases

9 sentences (read)

Page 56: Acoustic Databases

Contents of an ADBContents of SpeechDat-Car ADB For each session

Approx. 120-130 utterances (depending on session) Duration 2-3 hours Mixture of read and spontaneous utterances Mixture of

Phonetically rich sentencesApplication specific wordsUtterance types that will often occur in any application

600 sessions with min. 300 speakers

In 2 out of 7 conditions Standing still/ Low speed/ high speed Different road conditions/ surrounding noise Audio equipment on/off

Page 57: Acoustic Databases

Contents of an ADBContents of SpeechDat-Car ADB Digits and Digit Strings

1 sequence of 10 digits 1 sheet number (4+ digit sequence) 1 spontaneous telephone number 1 credit card number (16 digits) 1 PIN code (6 digits) 4 isolated digits

Dates 1 spontaneous date (e.g. birthday) 1 prompted date, word style 1 relative and general date expression

Page 58: Acoustic Databases

Contents of an ADBContents of SpeechDat-Car ADB Names

1 spontaneous name (e.g. speaker’s first name) 1 city of growing up (spontaneous) 2 most frequent city’s 2 company / agency / street names 1 person name (first name or surname)

Spellings 1 spontaneous spelled name (e.g. speaker’s first name) 1 spelling of city name 4 real words or names 1 artificial name (for coverage)

Page 59: Acoustic Databases

Contents of an ADBContents of SpeechDat-Car ADB Money Amounts/Natural Numbers

1 money amount 1 natural number

Times 1 time of day (spontaneous) 1 time phrase (word style)

Phonetically Rich words 4 phonetically rich words

Page 60: Acoustic Databases

Contents of an ADBContents of SpeechDat-Car ADB Application words

13 Mobile Phone Application words 22 IVR functions keywords 32 car products keywords 2 voice activation keywords 2 language dependent keywords

Sentences 2 phrases using an application word 9 phonetically rich sentences 10 prompts for spontaneous speech

Page 61: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 62: Acoustic Databases

Phases in Acoustic Resource Creation

Design

Creation of Script

Recruitment & Recording

Annotation and Validation

Lexicon

Quality control

Production

Page 63: Acoustic Databases

DESIGN

Language study:

• Phoneme set

• Dialects

Scripting: utterance definition and distribution over

speakers

Speaker typology: distribution definition

• Sex, gender, age, dialects, educational level

Recording: specification of procedure and platform

Validation: specification of procedure and quality standard

Page 64: Acoustic Databases

Creation of ScriptPrompts/Text/Transcription

A prompt refers to the way an utterance is presented to the speaker. This can be done on the desktop, on paper or with a play back file (telephony).

The (presentation) text represents the utterance as it should be pronounced by the speaker. It is normally presented according to the spelling conventions of the target language.

The transcription is the utterance as it has been pronounced by the speaker.

EXAMPLE: The pronunciation of a digit string• PROMPT: “Please read the number on top of your form”• TEXT: 578124• TRANSCRIPTION: five seven eight one two four

 

Page 65: Acoustic Databases

Creation of Script

Collect and clean text corpora Split cleaned text into a sequence of sentences

Remove ungrammatical and too long sentences

Remove sentences containing offensive language

Remove (certain) ambiguities in pronunciation

numbers, dates, abbreviations, etc.

Apply phonetic balancing tools to obtain phonetically rich

text

Page 66: Acoustic Databases

Creation of Script

Collect and/or create other utterance types

• Telephone numbers, amounts, credit card numbers, etc.

Create prompts

• Prerecorded messages to the speaker

• For unmonitored recording without access to screen

(telephony)

Put all these in resource files

Page 67: Acoustic Databases

Creation of ScriptScript File

Configuration:

• Acquisition board, Coding type

• Sampling rate, Number of channels

Information items

• Speaker id, Sheet id

• Gender, age, region of birth, region youth, living, etc. and their

possible values

• Recording environment/conditions

Sentence definitions

• Specifies order and types of utterances in one session

Page 68: Acoustic Databases

Creation of Script

Resource files utterance sheets

Generate letter with instructions and list of

utterances for each speaker (esp. telephony)

Page 69: Acoustic Databases

Creation of ScriptTools

Script Editor

• For creating/modifying scripts

• For creating utterance sheet files (from resource files)

• For generating letters to speakers

Digit String Generator

• Natural numbers

• Bank accounts

• Credit card numbers

• Phone numbers

• Pin-codes

Page 70: Acoustic Databases

Creation of Script

Test the script

• By making one or more recording sessions

• Also tests the recording set-up

• also provides indications for average session duration

Page 71: Acoustic Databases

RECRUITMENT

Contact potential speakers according to the typology• Acquaintances, colleagues

• Advertisements

• Employees/students of cooperating organizations (companies, universities)

• Possibly with the help of marketing agencies

Explain • purpose and context

• What the speaker is supposed to do

• How much time it will take

• Reimbursement for the speaker (time spent, travel costs)

Make concrete arrangements with the speakers

Locations : fitting environment definition

Set-up : recording platform

Interview : log speaker typology & recording conditions

Instructions : what do we expect from a speaker

Recording : follow-up on quality

Page 72: Acoustic Databases

RECORDING

Locations : Set up recording equipment in environment fitting

environment definition

Set-up recording platform and test it

Welcome speaker, instruct speaker

Interview : log speaker typology & recording conditions

Make recordings and follow-up on quality

Deal with administrative matters

• Agreement on ownership of recording

• Reimbursement

Page 73: Acoustic Databases

RECORDINGTOOL

Page 74: Acoustic Databases

VALIDATION and ANNOTATION

After recording, the signal will NEVER be touched

• Only enriched with annotations

check (and correct) relation between text & speech

• Orthographic transcription must represent what the speaker said

• Tool to expand abbreviations, numbers, digit sequences

Segmentation

• Check (and correct) begin and end of speech markers

• (mostly for TTS) Mark begin and end of phonemes

Page 75: Acoustic Databases

VALIDATION and ANNOTATION

assign quality label

• Very good overall quality … very bad overall quality

Annotations for extra events

• Speaker sounds (coughing, breathing, swallowing, …)

• Mispronunciations, truncations

• Sound from other sources (other speaker, music, radio, …)

• Continuous background noise (wind, rain, …)

• Filled pauses (uh, um, er, ah, hmm, ….)

• Telephone distortions

Page 76: Acoustic Databases

Validation Tool

Page 77: Acoustic Databases

Semi-Automatic Validation

Validation can be partially automated

For certain types of databases

70-75% reliably validated automatically

25-30% require manual check

Using ASR systems

Research into further automating this ongoing

Page 78: Acoustic Databases

LEXICON

One central “mother lexicon” for each language

• To reduce duplication of effort

• To maintain consistency

Request is compared with mother database

• Unfound entries imported in the mother database

• Unfound entries turned into a job

• Job is assigned to linguists

Page 79: Acoustic Databases

LEXICON

After finishing job

• Requested entries and properties are exported

• Turned into required format

• Delivered to requestor

Additions/modifications due to this request are

now available for other requests

Page 80: Acoustic Databases

LEXICON: Tools

Phoned

• Lexical database plus user interface

• (currently in Access but switching to SQL Server)

• Reuse of G2P and Synthesis Modules

PhonedAdmin

• import and export of data from the mother database

• Comparison with existing mother database

• Definition of users and jobs

• Assignment of jobs to users

Page 81: Acoustic Databases

QUALITY CONTROL

Typical Circumstances Database project is ongoing Often on a remote location Multiple persons (for recording and validation) Many questions, problems and unclarities arise constantly Require answers from specialists Danger of errors and inconsistencies

Within the work of a single person between different persons

Constant monitoring Systematic and regular quality checks required Systematic and regular feedback required

During the whole project From the earliest moment possible

Documentation, incl. spot check report

Page 82: Acoustic Databases

QUALITY CONTROL

Tools

• ADB Scanner—checks consistency of database

• Standard structure, All files available

• ADB Statistics

• Statistics on information items (sex, gender, age,

dialect, quality, etc.) and utterance types

• ADB Report Tool

• For creating parts of the documentation

• And others

Page 83: Acoustic Databases

PRODUCTION

Huge amount of data!

Multiple copies needed

Special fast CD-replicator equipment

Special cupboards for storing the CDs

Description in catalogue

Distribution

Conversion Tools (format converter, down sampling,

demultiplexing)

Page 84: Acoustic Databases

DAR Resource Description

Page 85: Acoustic Databases

DAR Resource Description: Statistics

Page 86: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 87: Acoustic Databases

General

More data!

The right data!

High Quality data

In-service Data

ASAP

Page 88: Acoustic Databases

SpeechDat Family

Consortium of industrial and university partners Often EU projects

One type of database is defined

Each partner makes one database according to spec

Each database is validated by external organization (SPEX, Nijmegen, the Netherlands)

After approval databases are exchanged among the partners

Max. 1-1.5 yr later data are offered for distribution by ELRA

http://www.speechdat.org/

Page 89: Acoustic Databases

Overview of major projects

SpeechDat (M)

SpeechDat-II

SpeechDat-E

SpeechDat-Car

SPEECON

SALA I

SALA II

Page 90: Acoustic Databases

SpeechDat (M)

EU-funded

production, standardization, evaluation and dissemination of Spoken Language Resources

8 fixed telephone network databases, 1000 speakers each; 1 mobile telephone network database, 300 speakers

Period: 1994-1996

Page 91: Acoustic Databases

SpeechDat (M)Partners

Siemens

Philips

Vocalis

CSELT

UPC

IDIAP

INESC

GEC MSIS

Page 92: Acoustic Databases

SpeechDat (M)Languages

German

French

Danish

Italian

Spanish

Portuguese

Swiss French

Page 93: Acoustic Databases

SpeechDat-II

EU-funded

Creation of Telephony Databases

25 fixed and mobile telephone network databases, 500-5000 speakers each; 3 speaker verification databases

Period: 1996-1998

Page 94: Acoustic Databases

SpeechDat-IIPartners

Aalborg University

Auditex

British Telecom

CSELT

DMI

ELRA

GEC

GPT

IDIAP

INESC

Knowledge S.A.

KTH

Lernout & Hauspie

Matra Nortel

Philips

Portugal Telecom

Siemens

SPEX

Swiss Telecom

Telenor

Univ. of Maribor

Univ. of Munich

Univ. of Patras

UPC

Vocalis

Page 95: Acoustic Databases

SpeechDat-IILanguages

Danish

Flemish

Belgian French

Luxemburg German

Luxemburg French

British English

Welsh

Finnish

Finnish Swedish

French French

Dutch

Swiss French

Swiss German

German

Slovenian

Greek

Italian

Portuguese

Spanish

Swedish

Norwegian

Page 96: Acoustic Databases

SpeechDat-E

EU-funded Eastern European Speech Databases for Creation

of Voice Driven Teleservices

Speech databases for fixed telephone networks suited for typical present-day teleservices plus phonetically rich set of material for vocabulary

independent ASR

1000 – 2500 speakers

Period: 1999-2001

Page 97: Acoustic Databases

SpeechDat-EPartners

Auditex Lernout & Hauspie Philips Speech

Processing Siemens ELRA SPEX

Brno University of Technology

Prague Technical University

Budapest University of Technology

Wroclaw University of Technology

Slovak Academy of Sciences

Page 98: Acoustic Databases

SpeechDat-ELanguages

Russian (2500) Czech Slovak Hungarian Polish

Page 99: Acoustic Databases

SpeechDat-Car

EU-funded

9 in-vehicle and mobile telephone network databases

300 speakers, each in 2 out of 7 conditions (600 recording sessions)

5 simultaneous channels

Period: Apr 1998 - Oct 2000

Page 100: Acoustic Databases

SpeechDat-CarPartners

Aalborg University

Alcatel

Robert Bosch GmbH

DMI

ELRA

Knowledge S.A.

Lernout & Hauspie

L&H France (formerly Matra Nortel)

Nokia

Renault

SEAT

SPEX

University of Munich

UPC

Vocalis

Volkswagen

Page 101: Acoustic Databases

SpeechDat-CarLanguages

Danish

British English

Finnish

Flemish/Dutch

French

German

Greek

Italian

Spanish

American English

Page 102: Acoustic Databases

SPEECON

Speech driven interfaces for consumer devices

Speech databases for voice controlled consumer applications • television sets, video recorders, mobile phones, palmtop

computers, car navigation kits or even microwave ovens and toasters.

600 speakers

Period: 2000-2003

Page 103: Acoustic Databases

SPEECONPartners

DaimlerChrysler

Ericsson

IBM

Lernout & Hauspie

Natural Speech Communications

Nokia

Philips Speech Processing

Siemens

Sony

Temic Telefunken

Page 104: Acoustic Databases

SPEECONLanguages

EU Spanish

Russian

Italian

Swedish

German

UK English

Danish

Flemish

US English

US Spanish

Hebrew

French

Finnish

Mandarin

Dutch

Japanese

Polish

Portuguese

Swiss German

Cantonese

Page 105: Acoustic Databases

SALA I

SpeechDat Across Latin America

Not government-subsidized

Speech databases for fixed telephony, Latin America

1000-2000 speakers per database

Period: 1998-2001

Page 106: Acoustic Databases

SALAPartners

CSELT

ELRA

Lernout & Hauspie

Lucent

Philips

 

Siemens

SPEX

UPC

Vocalis

 

Page 107: Acoustic Databases

SALALanguages

Brasil (Portuguese, 2000)

Mexico (2000)

Caribbean islands and Venezuela

Central America

 

Panama, Columbia

Ecuador, Peru, Bolivia

Chile

Argentina, Uruguay, Paraguay

 

Page 108: Acoustic Databases

SALA II

Not government-subsidized

to create speech databases telephone cellular oriented applications

America (North and Latin)

1000 (or 2000) speakers

Period: 2001-2002

(project just starting up)

Page 109: Acoustic Databases

SALA II Partners

ATLAS

ELRA

IBM

Lernout&Hauspie

Loquendo

Lucent

 

NSC

Philips

Siemens

SPEX

UPC 

Page 110: Acoustic Databases

SALA II Languages

Venezuela

Peru

Mexico

Chile

Argentina

Costa Rica

Brasil

Colombia

 

American English Canada

US English North East

US Spanish East

US English South East

US Spanish West

US English North West

Page 111: Acoustic Databases

Future

Non-native/multilingual ASR

Data for Speech-to-Speech Translation

Access to information • anytime• anywhere• by way of any device

More use of spontaneous speech (“conversational systems”)

Page 112: Acoustic Databases

Future

Devices will become • increasingly smaller (“mobile”)• Increasingly more powerful• Connected to information sources such as Internet etc robustness against different environments

Input/Output• Limited• Keyboard and screens less convenient• Opportunity for speech input and output• Other input/output methods get different roles Multi-modal input and output systems

Page 113: Acoustic Databases

Future

Distributed systems• Part of the recognition/synthesis on the local system

(“client”)• Part on the server• Dynamically adaptable local systems

In car, speech is • “Hands-free” and• “Eyes-free” solution

Page 114: Acoustic Databases

Overview

What is a speech database?

How is it used?

What does it contain?

How is it created?

Industrial needs

Technologies and applications

Page 115: Acoustic Databases

Why Speech User Interface

Pro• Audio feedback draws

attention• Complex commands E.G.

Control your VCR• Fast and simple - Chinese !!!

• Speech input: 50-250 wpm

• Typing: 20-90 wpm

• Handwriting: 25 wpm

• Pointing: 10-40/min• Eyes free• Hands fee• Mobile• Compact i/o devices

Con• audio messages difficult to

remember if too long. E.g. telephone number, address

• “a drawing can replace a thousand words”

• privacy• sometimes cumbersome.

E.g. control a cursor on a screen

• voice wear-out

Page 116: Acoustic Databases

Text-to-Speech engines

Processor power & memory

Voice Quality

RealSpeak

RealSpeakCompact

RealSpeakUltraCompact

Human-like

Machine-like

TTS3000

TTS2500

low high

Page 117: Acoustic Databases

Text-to-Speech engines

TTS2500• Low quality, small footprint engine for talking dictionary products• Available, no additional R&D

TTS3000• Medium quality engines• Limited footprint, high densities• Limited developments

RealSpeak compact• Target: handheld devices

RealSpeak• High-end system

Page 118: Acoustic Databases

RealSpeak TTS

New generation, human sounding TTS

Target: server based telephony, PCMM

Platform Requirements: • CPU: 48 real time instances PIII 450MHz (8 kHz speech data)• RAM: < 250 kB/instance ROM: 4-6 MB

Speechbases:• 8 kHz uncompressed: ~ 250 MB• 8, 11 kHz compressed: 20 – 30 MB• 22 kHz compressed: 70 – 90 MB

20 languages: US English, 15 European and 4 Asian languages

2 languages under development

Page 119: Acoustic Databases

RealSpeak Compact

High quality, medium footprint TTS

Target: mobile and embedded platforms

Platform Requirements: • 150 MIPS• RAM: < 250 kB/instance; 4-6 MB common• ROM: 16 MB (includes 11 kHz speechbase)

Derived automatically from RealSpeak

RealSpeak ultra compact under development

Page 120: Acoustic Databases

TTS3000

Low footprint, highly intelligible TTS engine

Target: Telephony, PCMM, Mobile, Embedded

Platform requirements:• CPU: 20 – 30 MIPS• RAM: 100 kB/instance; ROM: 2 - 3 MB

13 languages including:• US English• 7 European• 3 Asian languages

2 languages under development

Page 121: Acoustic Databases

TTS2500

Dedicated TTS for very low footprint talking dictionaries

Analysis on 8 or 16 bit processor: <2 Mips

Synthesis on dedicated chip (LH3010 or LH3030 ) or DSP (ADSP21xx)

1.5 MB ROM, 16 KB RAM

Languages:• American English• Mandarin Chinese• Mexican Spanish• German• French

Page 122: Acoustic Databases

Dimensions of ASR

Speaker• Independent - adaptive - dependent• Native - non-native• Man, woman, child

Recording conditions• Recording device: telephone, GSM, microphone, tape

recorder• Environment: quiet office, home, car, factory, street…

Implementation• Platform: PC, embedded• CPU and memory

Page 123: Acoustic Databases

Dimensions of ASR

Size of the (active) vocabulary• Small (10-100) - medium (100-1000) - large(>1000) - very large

(>10000)

Flexibility of the vocabulary• Fixed (factory-definable)- User-definable• phoneme based => speaker independent• user words => speaker dependent

Word sequences • Isolated words - sentences - word spotting• Fixed grammar - flexible language model• Discrete - continuous speech

Language• Language independent engine, language dependent data files• Swapping language files

Page 124: Acoustic Databases

Different Applications, Different Needs

Dictation• Speaker dependent, large vocabulary, continuous speech, quiet office,

PC

Command & control, name dialing• Speaker independent, small to large vocabulary, noise robust, DSP

boards and/or client-server

Dialogue systems• Speaker independent, medium to large vocabulary, noise robust, client-

server

Security: verification• Speaker dependent, combination of password (what) + speaker

characteristics (who)

Language learning• Non-native speakers, punish mistakes rather than being tolerant

Page 125: Acoustic Databases

Automatic Speech Recognition

L&H speech recognition engines cover a broad range of tasks, processor types, operating systems and input signal types:

Tasks: • Large vocabulary continuous real-time dictation, • Large vocabulary batch transcription, • Grammar-based recognition – large, medium and small vocabularies, • Small-vocabulary isolated word recognition.

Platforms: • PC, • Server, • Handheld, Embedded, • Distributed.

Page 126: Acoustic Databases

Automatic Speech Recognition engines

Processor power & memory

XCalibur,MREC

VX

Task Complexity

ASR1600ASR1500

Large VocabularyOpen GrammarDictation

Large VocabularyClosed Grammar

Medium VocabularyClosed Grammar

Isolated word recognition

Server

Mobile Terminal

ASR300

ASR100

low high

Page 127: Acoustic Databases

Recognition engines …

Input conditions: • Environments: home, office, public/industrial, car.• Channels: telephone (wireline, wireless), wideband, mobile

devices.• Microphones: close-talking, far-talking.• Combinations: e.g., broadcast material.

A wide range of processor/memory operating points: • 200Mips/32MB, • 60Mips/1MB, • 20Mips/300KB, • 5-10Mips/<30KB

Page 128: Acoustic Databases

Recognition engines ….

ASR100: • 5-10 Mips, • < 30 KB, • Speaker-dependent• Recording device: mic./phone• Sampling Frequency: 8/11 kHz• Environment: office• Vocabulary: small and user-adaptable• Grammar: isolated• Speech: Isolated• OS: various• Architecture: stand-alone• Languages: language-independent

Applications• embedded• cell-phone dialing• toys

Page 129: Acoustic Databases

Recognition engines ….

ASR300: • 20 Mips, • 300 KB, • SI & SD• Sampling Frequency: 8/11 kHz• Vocabulary: small and factory-adaptable• Highly noise robust• Environment: office/car/other noisy environments• Unit: word-dependent• Grammar: isolated• Speech: quasi-connected command and control • OS: various• Architecture: stand-alone• Languages: US English, French, Italian, Korean, German, Japanese

Applications• In car command and control• Command and control of toys, games• Command and control in noisy industrial environments

Page 130: Acoustic Databases

Recognition engines ….

ASR1500• 60 Mips, • 1MB • SI and speaker-adaptive• Vocabulary: medium size; user-adaptable• Sampling Frequency: 8kHz• Environment: office• Recording Device: telephone/ mobile phone• Grammar: finite state• Speech: Continuous• Unit: phoneme• OS: various• Architecture: Stand-alone and client/server• Languages: US English, 10 European languages, 4 Asian languages

Applications• IVR applications over the phone

• Reverse directory, Automated attendant• Information provider— stock quotes• Ordering systems

• ASR1600 highly noise robust

Page 131: Acoustic Databases

Recognition engines ….

ASR1600• 60 Mips, • 1MB • SI and speaker-adaptive• Vocabulary: medium size; user-adaptable• Sampling Frequency: 11kHz• Environment: office, car; highly noise-robust• Recording Device: mic.• Grammar: finite state• Speech: Continuous• Unit: phoneme• OS: various• Architecture: Stand-alone and client/server• Languages: US English, 10 European languages, 4 Asian languages

Applications• In-Car recognition

• Command and Control• Embedded devices

• PDA’s, SmartPhones

Page 132: Acoustic Databases

Recognition engines ….

Mrec/VX: • > 200 Mips• > 64MB• SI and speaker-adaptive • Vocabulary: very large (> 64,000 words)• Sampling Frequency: 22 (16) kHz• Environment: Office• Recording Device: mic.• Grammar: statistical• Speech: Continuous• Unit: phoneme• OS: Windows• Architecture: Stand-alone• Languages: US English and Spanish, 7 European languages, 2 Asian

languages

Applications• document creation, incl. command and control• MediaIndexer (Mrec)• Speech Transcription (Mrec)

Page 133: Acoustic Databases

Recognition engines ….

Xcalibur • scalable• SI and speaker-adaptive • Vocabulary: very large (> 64,000 words)• Sampling Frequency: 22 (16) kHz• Environment: Office, (Telephony, Car)• Recording Device: mic.• Grammar: statistical and rule-based• Speech: Continuous• Unit: phoneme• OS: Windows• Architecture: Stand-alone and client-server• Languages: Currently only Japanese

Applications• document creation• command and control • Focus on conversational systems