Upload
kesia
View
34
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Acoustic Databases. Jan Odijk ELSNET Summer School, Prague, 2001. Acknowledgements. Part of the slides have been borrowed from or are based on work by Bart D’Hoore Hugo van Hamme Robrecht Comeyne Dirk van Compernolle Bert van Coile. Overview. What is a speech database? - PowerPoint PPT Presentation
Citation preview
Acoustic Databases
Jan Odijk
ELSNET Summer School, Prague, 2001
Acknowledgements
Part of the slides
have been borrowed from or
are based on work by• Bart D’Hoore• Hugo van Hamme• Robrecht Comeyne• Dirk van Compernolle• Bert van Coile
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Linguistic Resources(LRs)
Linguistic Resources are sets of language data in machine readable form that can be used for developing, improving or evaluating language and speech technologies.
Some language and speech technologies• Text-To-Speech (TTS)• Automatic Speech Recognition (ASR)• Dictation• Speaker Verification/recognition• Spoken Dialogue• Audio Mining• Machine Translation• Intelligent Content Management • ….
Linguistic Resources(LRs)Major Types
Electronic Text Corpora• Newspapers, magazines, etc.• Usenet texts, e-mail, correspondence• Etc.
Lexical Resources• Monolingual lexicons• Translation lexicons• Thesauri• …
Acoustic Resources• Annotated Speech Recordings• Annotated Recordings of other acoustic signals
• Coughing, throat clearing, breathing, …
• Door slamming, screeching tires (of a car),…
Types of Linguistic Resources Acoustic Resources
Acoustic Databases (ADBs)• Controlled recording of human speech or other acoustic
signals• Enriched with annotations• Recorded in a digital way• Representative of targeted application environment and
medium• Balanced for phonemes/phoneme combinations• Speaker parameters, recording quality,
environment/medium documented
Types of Linguistic Resources Acoustic Resources
Annotated unstructured recordings • Broadcasted material• Recorded conversations/monologues/speeches etc• Dictated material• Enriched with annotations
Types of Linguistic Resources Acoustic Resources
In-service data• Recorded sessions of interaction humans-running
application• Usually by logging a customer system• Enriched with annotations• Used for tuning models, grammars,etc. to specific
application
Types of Linguistic Resources Acoustic Resources
Environments• “Quiet”
• Studio
• Quiet office
• Normal office• Noisy
• Public place (street, hotel lobby, station, etc.)
• Car (running engine 0km/hr, city, highway)
• Industrial environment
Types of Linguistic Resources Acoustic Resources
Media• HQ close-talk microphone• Desktop Microphones• Telephone
• analog or digital
• fixed line or mobile• Wide band microphones• Array microphones• PC/PDA etc. low quality microphone
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Acoustic Resources Use
(for speech synthesis modules in TTS systems)
(as acoustic reference material for pronunciation lexicons)
Mainly for speech recognition
Training and test material for research into new recognition engines and engine features
Training and test material for development of acoustic models
Tuning of acoustic models for specific applications
What is speech recognition?
ASR: Automatic speech recognition
Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text.
Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.
Speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples.
Speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples.
Elements of a Recognizer
SpeechData
NaturalLanguage
Understanding
Action
Displaytext
Meaning
AcousticModel
LanguageModel
PatternMatching
FeatureExtraction
Post Processing
Elements of a Recognizer
SpeechData
NaturalLanguage
Understanding
Action
Displaytext
Meaning
AcousticModel
LanguageModel
PatternMatching
FeatureExtraction
Post Processing
Feature Extraction
Turning speech signal into something more manageable• Do analysis once every 10ms• Data compression: 220 byte => 50 byte => 4 byte
Sampling of a signal: transforming into a digital form
Extracting relevant parameters from the signal• Spectral information, energy, pitch,...
Eliminate undesirable elements (normalization)• Noise• Channel properties• Speaker properties (gender)
Feature Extraction: Vectors
Signal is chopped in small pieces (frames), typically 30 ms
Spectral analysis of a speech frame produces a vector representing the signal properties.
=> result = stream of vectors
-4
-3
-2
-1
0
1
2
3
410.31.2-0.9 .0.2
Elements of a Recognizer
SpeechData
NaturalLanguage
Understanding
Action
Displaytext
Meaning
AcousticModel
LanguageModel
PatternMatching
FeatureExtraction
Post Processing
Acoustic Model
Split utterance into basic units, e.g. phonemes
The acoustic model describes the typical spectral shape (or typical vectors) for each unit
For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme
Must cope with pronunciation variability• Utterances of the same word by the same speaker are never
identical• Differences between speakers• Identical phonemes sound differently in different words
=> statistical techniques: creation via a lot of examples
S1 S2 S4 S5S3 S13S12S11S6 S7 S8 S9 S10
f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s
Word: series of units specific to the word
Acoustic Model: Units
S1 S2 S4S3
Phoneme: share units that model the same sound
S6 S7 S9 S10S8
Stop
Start
S T PO
S T R TA
Stop
Start
Acoustic Model: Units
Context dependent phoneme
S|,|T T|S|O P|O|,O|T|P Stop
Stop,S ST OPTO P,
Diphone
Other sub-word units: consonant clusters
ST O P Stop
Acoustic Model: Units
Phonemes
Phonemes in context: spectral properties depend on previous and following phoneme
Diphones
Sub-words: syllables, consonant clusters
Words
Multi words: example: “it is”, “going to”
Combinations of all of the above
Elements of a Recognizer
SpeechData
NaturalLanguage
Understanding
Action
Displaytext
Meaning
AcousticModel
LanguageModel
PatternMatching
FeatureExtraction
Post Processing
Pattern matching
Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model.
= Local score
Calculate score of a word, indicating how well the word matches the string of incoming features (viterbi)
Search algorithm: looks for the best scoring word or word sequence
Elements of a Recognizer
SpeechData
NaturalLanguage
Understanding
Action
Displaytext
Meaning
AcousticModel
LanguageModel
PatternMatching
FeatureExtraction
Post Processing
Language Model
Describes how words are connected to form a sentence
Limit possible word sequences
Reduce number of recognition errors by eliminating unlikely sequences
Increase speed of recognizer => real time implementations
Language Model
Two major types• Grammar based
!start <sentence>;
<sentence>: <yes> | <no>;
<yes>: yes | yep | yes please ;
<no>: no | no thanks | no thank you ;
• Statistical
• Probability of single words, 2/3-word sequences
• Derived from frequencies in a large corpus
Active Vocabulary
Lists words that can be recognized by the acoustic model
That are allowed to occur given the language model
Each word associated with a phonetic transcription• Enumerated, and/or• Generated by a Grapheme-to-Phoneme (G2P) module
Post Processing
Re-ordering of Nbest list using other criteria: e.g. account numbers, telephone numbers
Spelling: name search from a list of known names
Applying NLP techniques that fall outside the scope of the statistical language model• E.g. “three dollars fifty cents” “$ 3.50”• “doctor Jones” “Dr. Jones”• Etc.
Training of Acoustic Models
AnnotatedSpeech
Database
PronunciationDictionary
AcousticModel
Training Program
Training of Acoustic Models
Database design• Coverage of units: word, phoneme, context dependent unit• Coverage of population (region, dialect, age, …)• Coverage of environments (car, telephone, office,..)
Database collection and validation• Checking recording quality• Annotation: describing what people said, extra-speech
sounds
Dictionaries• Phonetic transcription of words• Multiple transcriptions needed• G2P: automatic transcription
Feature vectors 10.31.2-0.9 .0.2
2.1-0.21.9 .-0.3
……...8.1-0.51.3 .0.2
......
Example: discrete models
A collection of prototypes is constructed (100 to 250)
Each vector is replaced by its nearest prototype
-6
-4
-2
0
2
4
6
8
-4 -2 0 2 4 6 8
Vectoren
Prototypes
Prototypes
Feature vectors 10.31.2-0.9 .0.2
2.1-0.21.9 .-0.3
……...8.1-0.51.3 .0.2
......
,,,3 9 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,,
ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz
2276998900023448889211127780128897791237787622
f r E n d l I k O m p j u t $ z ,0 3 1 11 1 1 1 22 2 1 2 1 2 13 1 14 256 2 27 2 2 2 28 1 3 1 19 2 1 1 2
f r E n d l I k O m p j u t $ z ,0 0 0 0 75 0 0 0 50 0 50 0 0 0 0 0 0 01 0 0 0 0 0 0 0 50 0 50 0 0 0 33 0 0 672 100 0 0 0 50 0 0 0 0 0 100 0 0 33 0 100 333 0 0 0 0 50 0 0 0 0 0 0 0 0 33 0 0 04 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 50 0 0 0 0 0 0 0 0 0 100 0 0 0 0 07 0 50 0 0 0 0 0 0 67 0 0 0 50 0 67 0 08 0 0 33 0 0 0 75 0 33 0 0 0 0 0 33 0 09 0 0 67 25 0 0 25 0 0 0 0 0 50 0 0 0 0
Prototypes
Phonemeassignment
Training of Acoustic Models
Create New Models
For all utterances in database:
Make phonetic transcription of a sentence
Use models to segment the utterance file: assign a phoneme to each speech frame
Collect statistical information:Count prototype-phoneme occurrences
Key Element in ASR
ASR is based on learning from observations• Huge amount of spoken data needed for making acoustic
models• Huge amount of text data needed for making language
models
=> Lots of statistics, few rules
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Contents of an ADB
Utterances of different utterance types
Utterance types suited to the intended application domain
Text balanced for phoneme and/or diphone distribution
All enriched with annotations
Contents of an ADBSpontaneous v. Read Utterances A spontaneous utterance is a response to a
question or a request “In which city do you live?” “Please spell a letter heading to your secretary” “Is English your mother tongue?” “Make a hotel reservation”
A read utterance is an utterance read from a presentation text “London” “Dear John” “Yes” “Please book me a room for 2 persons with bath. We will
arrive ….”
Contents of an ADB
Isolated Phonetically Rich Word Apple Tree, Lobster
Isolated Digit 5
Isolated Alphabet B
Isolated number (natural number) 4256
Contents of an ADB
Continuous Digits 9 1 1
Continuous Alphabet Y M C A
Commands Stop, left, print, call, next
Contents of an ADBConnected Digits
Telephone Numbers 057/228888
Credit Card Numbers 3741 959289 310001
Pin-codes 8978
Social Security Number 560228 561 80
Other identification numbers, e.g. sheet id 012589225712
Contents of an ADBTime and Date Expressions
Time (“analog”, word style) A quarter past two
Time (“digital”) 14:15 2:15PM
Date (“analog”, word style, absolute) Friday, June 25th, 1999 Christmas’ Eve, Easter
Date (“digital”, absolute) 25/06/99
Date (“analog”, word style, relative) Tomorrow, next week, in one month
Contents of an ADB
Money amounts $327.67 £148.95
Isolated Phonetically Rich Sentences A cold supper was ordered and a bottle of port
Isolated Command Sentences Insert this name in the list
Names Microsoft, New York, Jonathan
Syllables Hi-ta-chi
Contents of an ADB
Continuous Phonetically Rich Sentences Once upon a time, in a land far from here, lived a little
princess. She was the most beautiful girl…
Continuous Command Sentences Select the first line. Make it bold and move it to the bottom
of the text…
Continuous Spontaneous speech <Make a reservation in a hotel>
Contents of an ADBContents of SpeechDat-II ADB
For each speaker/session Approx. 40 utterances Duration approx. 10 minutes Mixture of read and spontaneous utterances Mixture of
Phonetically rich sentencesApplication specific wordsUtterance types that will often occur in any application
1000-5000 speakers/sessions
Contents of an ADBContents of SpeechDat-II ADB
1 isolated digit
4-digit id (sheet number)
3 connected digits (~10-digit telephone number)
12-digit credit card number
3 natural numbers
2 money amounts: 1 large, 1 small
3 spelled words
1 time of day (spontaneous)
Contents of an ADBContents of SpeechDat-II ADB
1 time phrase (read, word style)
1 date (spontaneous, e.g. person’s birthday)
2 dates (read, word style)
3 yes/no questions
1 city of call/birth
6 common application words out of 50
3 application word phrases
9 sentences (read)
Contents of an ADBContents of SpeechDat-Car ADB For each session
Approx. 120-130 utterances (depending on session) Duration 2-3 hours Mixture of read and spontaneous utterances Mixture of
Phonetically rich sentencesApplication specific wordsUtterance types that will often occur in any application
600 sessions with min. 300 speakers
In 2 out of 7 conditions Standing still/ Low speed/ high speed Different road conditions/ surrounding noise Audio equipment on/off
Contents of an ADBContents of SpeechDat-Car ADB Digits and Digit Strings
1 sequence of 10 digits 1 sheet number (4+ digit sequence) 1 spontaneous telephone number 1 credit card number (16 digits) 1 PIN code (6 digits) 4 isolated digits
Dates 1 spontaneous date (e.g. birthday) 1 prompted date, word style 1 relative and general date expression
Contents of an ADBContents of SpeechDat-Car ADB Names
1 spontaneous name (e.g. speaker’s first name) 1 city of growing up (spontaneous) 2 most frequent city’s 2 company / agency / street names 1 person name (first name or surname)
Spellings 1 spontaneous spelled name (e.g. speaker’s first name) 1 spelling of city name 4 real words or names 1 artificial name (for coverage)
Contents of an ADBContents of SpeechDat-Car ADB Money Amounts/Natural Numbers
1 money amount 1 natural number
Times 1 time of day (spontaneous) 1 time phrase (word style)
Phonetically Rich words 4 phonetically rich words
Contents of an ADBContents of SpeechDat-Car ADB Application words
13 Mobile Phone Application words 22 IVR functions keywords 32 car products keywords 2 voice activation keywords 2 language dependent keywords
Sentences 2 phrases using an application word 9 phonetically rich sentences 10 prompts for spontaneous speech
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Phases in Acoustic Resource Creation
Design
Creation of Script
Recruitment & Recording
Annotation and Validation
Lexicon
Quality control
Production
DESIGN
Language study:
• Phoneme set
• Dialects
Scripting: utterance definition and distribution over
speakers
Speaker typology: distribution definition
• Sex, gender, age, dialects, educational level
Recording: specification of procedure and platform
Validation: specification of procedure and quality standard
Creation of ScriptPrompts/Text/Transcription
A prompt refers to the way an utterance is presented to the speaker. This can be done on the desktop, on paper or with a play back file (telephony).
The (presentation) text represents the utterance as it should be pronounced by the speaker. It is normally presented according to the spelling conventions of the target language.
The transcription is the utterance as it has been pronounced by the speaker.
EXAMPLE: The pronunciation of a digit string• PROMPT: “Please read the number on top of your form”• TEXT: 578124• TRANSCRIPTION: five seven eight one two four
Creation of Script
Collect and clean text corpora Split cleaned text into a sequence of sentences
Remove ungrammatical and too long sentences
Remove sentences containing offensive language
Remove (certain) ambiguities in pronunciation
numbers, dates, abbreviations, etc.
Apply phonetic balancing tools to obtain phonetically rich
text
Creation of Script
Collect and/or create other utterance types
• Telephone numbers, amounts, credit card numbers, etc.
Create prompts
• Prerecorded messages to the speaker
• For unmonitored recording without access to screen
(telephony)
Put all these in resource files
Creation of ScriptScript File
Configuration:
• Acquisition board, Coding type
• Sampling rate, Number of channels
Information items
• Speaker id, Sheet id
• Gender, age, region of birth, region youth, living, etc. and their
possible values
• Recording environment/conditions
Sentence definitions
• Specifies order and types of utterances in one session
Creation of Script
Resource files utterance sheets
Generate letter with instructions and list of
utterances for each speaker (esp. telephony)
Creation of ScriptTools
Script Editor
• For creating/modifying scripts
• For creating utterance sheet files (from resource files)
• For generating letters to speakers
Digit String Generator
• Natural numbers
• Bank accounts
• Credit card numbers
• Phone numbers
• Pin-codes
Creation of Script
Test the script
• By making one or more recording sessions
• Also tests the recording set-up
• also provides indications for average session duration
RECRUITMENT
Contact potential speakers according to the typology• Acquaintances, colleagues
• Advertisements
• Employees/students of cooperating organizations (companies, universities)
• Possibly with the help of marketing agencies
Explain • purpose and context
• What the speaker is supposed to do
• How much time it will take
• Reimbursement for the speaker (time spent, travel costs)
Make concrete arrangements with the speakers
Locations : fitting environment definition
Set-up : recording platform
Interview : log speaker typology & recording conditions
Instructions : what do we expect from a speaker
Recording : follow-up on quality
RECORDING
Locations : Set up recording equipment in environment fitting
environment definition
Set-up recording platform and test it
Welcome speaker, instruct speaker
Interview : log speaker typology & recording conditions
Make recordings and follow-up on quality
Deal with administrative matters
• Agreement on ownership of recording
• Reimbursement
RECORDINGTOOL
VALIDATION and ANNOTATION
After recording, the signal will NEVER be touched
• Only enriched with annotations
check (and correct) relation between text & speech
• Orthographic transcription must represent what the speaker said
• Tool to expand abbreviations, numbers, digit sequences
Segmentation
• Check (and correct) begin and end of speech markers
• (mostly for TTS) Mark begin and end of phonemes
VALIDATION and ANNOTATION
assign quality label
• Very good overall quality … very bad overall quality
Annotations for extra events
• Speaker sounds (coughing, breathing, swallowing, …)
• Mispronunciations, truncations
• Sound from other sources (other speaker, music, radio, …)
• Continuous background noise (wind, rain, …)
• Filled pauses (uh, um, er, ah, hmm, ….)
• Telephone distortions
Validation Tool
Semi-Automatic Validation
Validation can be partially automated
For certain types of databases
70-75% reliably validated automatically
25-30% require manual check
Using ASR systems
Research into further automating this ongoing
LEXICON
One central “mother lexicon” for each language
• To reduce duplication of effort
• To maintain consistency
Request is compared with mother database
• Unfound entries imported in the mother database
• Unfound entries turned into a job
• Job is assigned to linguists
LEXICON
After finishing job
• Requested entries and properties are exported
• Turned into required format
• Delivered to requestor
Additions/modifications due to this request are
now available for other requests
LEXICON: Tools
Phoned
• Lexical database plus user interface
• (currently in Access but switching to SQL Server)
• Reuse of G2P and Synthesis Modules
PhonedAdmin
• import and export of data from the mother database
• Comparison with existing mother database
• Definition of users and jobs
• Assignment of jobs to users
QUALITY CONTROL
Typical Circumstances Database project is ongoing Often on a remote location Multiple persons (for recording and validation) Many questions, problems and unclarities arise constantly Require answers from specialists Danger of errors and inconsistencies
Within the work of a single person between different persons
Constant monitoring Systematic and regular quality checks required Systematic and regular feedback required
During the whole project From the earliest moment possible
Documentation, incl. spot check report
QUALITY CONTROL
Tools
• ADB Scanner—checks consistency of database
• Standard structure, All files available
• ADB Statistics
• Statistics on information items (sex, gender, age,
dialect, quality, etc.) and utterance types
• ADB Report Tool
• For creating parts of the documentation
• And others
PRODUCTION
Huge amount of data!
Multiple copies needed
Special fast CD-replicator equipment
Special cupboards for storing the CDs
Description in catalogue
Distribution
Conversion Tools (format converter, down sampling,
demultiplexing)
DAR Resource Description
DAR Resource Description: Statistics
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
General
More data!
The right data!
High Quality data
In-service Data
ASAP
SpeechDat Family
Consortium of industrial and university partners Often EU projects
One type of database is defined
Each partner makes one database according to spec
Each database is validated by external organization (SPEX, Nijmegen, the Netherlands)
After approval databases are exchanged among the partners
Max. 1-1.5 yr later data are offered for distribution by ELRA
http://www.speechdat.org/
Overview of major projects
SpeechDat (M)
SpeechDat-II
SpeechDat-E
SpeechDat-Car
SPEECON
SALA I
SALA II
SpeechDat (M)
EU-funded
production, standardization, evaluation and dissemination of Spoken Language Resources
8 fixed telephone network databases, 1000 speakers each; 1 mobile telephone network database, 300 speakers
Period: 1994-1996
SpeechDat (M)Partners
Siemens
Philips
Vocalis
CSELT
UPC
IDIAP
INESC
GEC MSIS
SpeechDat (M)Languages
German
French
Danish
Italian
Spanish
Portuguese
Swiss French
SpeechDat-II
EU-funded
Creation of Telephony Databases
25 fixed and mobile telephone network databases, 500-5000 speakers each; 3 speaker verification databases
Period: 1996-1998
SpeechDat-IIPartners
Aalborg University
Auditex
British Telecom
CSELT
DMI
ELRA
GEC
GPT
IDIAP
INESC
Knowledge S.A.
KTH
Lernout & Hauspie
Matra Nortel
Philips
Portugal Telecom
Siemens
SPEX
Swiss Telecom
Telenor
Univ. of Maribor
Univ. of Munich
Univ. of Patras
UPC
Vocalis
SpeechDat-IILanguages
Danish
Flemish
Belgian French
Luxemburg German
Luxemburg French
British English
Welsh
Finnish
Finnish Swedish
French French
Dutch
Swiss French
Swiss German
German
Slovenian
Greek
Italian
Portuguese
Spanish
Swedish
Norwegian
SpeechDat-E
EU-funded Eastern European Speech Databases for Creation
of Voice Driven Teleservices
Speech databases for fixed telephone networks suited for typical present-day teleservices plus phonetically rich set of material for vocabulary
independent ASR
1000 – 2500 speakers
Period: 1999-2001
SpeechDat-EPartners
Auditex Lernout & Hauspie Philips Speech
Processing Siemens ELRA SPEX
Brno University of Technology
Prague Technical University
Budapest University of Technology
Wroclaw University of Technology
Slovak Academy of Sciences
SpeechDat-ELanguages
Russian (2500) Czech Slovak Hungarian Polish
SpeechDat-Car
EU-funded
9 in-vehicle and mobile telephone network databases
300 speakers, each in 2 out of 7 conditions (600 recording sessions)
5 simultaneous channels
Period: Apr 1998 - Oct 2000
SpeechDat-CarPartners
Aalborg University
Alcatel
Robert Bosch GmbH
DMI
ELRA
Knowledge S.A.
Lernout & Hauspie
L&H France (formerly Matra Nortel)
Nokia
Renault
SEAT
SPEX
University of Munich
UPC
Vocalis
Volkswagen
SpeechDat-CarLanguages
Danish
British English
Finnish
Flemish/Dutch
French
German
Greek
Italian
Spanish
American English
SPEECON
Speech driven interfaces for consumer devices
Speech databases for voice controlled consumer applications • television sets, video recorders, mobile phones, palmtop
computers, car navigation kits or even microwave ovens and toasters.
600 speakers
Period: 2000-2003
SPEECONPartners
DaimlerChrysler
Ericsson
IBM
Lernout & Hauspie
Natural Speech Communications
Nokia
Philips Speech Processing
Siemens
Sony
Temic Telefunken
SPEECONLanguages
EU Spanish
Russian
Italian
Swedish
German
UK English
Danish
Flemish
US English
US Spanish
Hebrew
French
Finnish
Mandarin
Dutch
Japanese
Polish
Portuguese
Swiss German
Cantonese
SALA I
SpeechDat Across Latin America
Not government-subsidized
Speech databases for fixed telephony, Latin America
1000-2000 speakers per database
Period: 1998-2001
SALAPartners
CSELT
ELRA
Lernout & Hauspie
Lucent
Philips
Siemens
SPEX
UPC
Vocalis
SALALanguages
Brasil (Portuguese, 2000)
Mexico (2000)
Caribbean islands and Venezuela
Central America
Panama, Columbia
Ecuador, Peru, Bolivia
Chile
Argentina, Uruguay, Paraguay
SALA II
Not government-subsidized
to create speech databases telephone cellular oriented applications
America (North and Latin)
1000 (or 2000) speakers
Period: 2001-2002
(project just starting up)
SALA II Partners
ATLAS
ELRA
IBM
Lernout&Hauspie
Loquendo
Lucent
NSC
Philips
Siemens
SPEX
UPC
SALA II Languages
Venezuela
Peru
Mexico
Chile
Argentina
Costa Rica
Brasil
Colombia
American English Canada
US English North East
US Spanish East
US English South East
US Spanish West
US English North West
Future
Non-native/multilingual ASR
Data for Speech-to-Speech Translation
Access to information • anytime• anywhere• by way of any device
More use of spontaneous speech (“conversational systems”)
Future
Devices will become • increasingly smaller (“mobile”)• Increasingly more powerful• Connected to information sources such as Internet etc robustness against different environments
Input/Output• Limited• Keyboard and screens less convenient• Opportunity for speech input and output• Other input/output methods get different roles Multi-modal input and output systems
Future
Distributed systems• Part of the recognition/synthesis on the local system
(“client”)• Part on the server• Dynamically adaptable local systems
In car, speech is • “Hands-free” and• “Eyes-free” solution
Overview
What is a speech database?
How is it used?
What does it contain?
How is it created?
Industrial needs
Technologies and applications
Why Speech User Interface
Pro• Audio feedback draws
attention• Complex commands E.G.
Control your VCR• Fast and simple - Chinese !!!
• Speech input: 50-250 wpm
• Typing: 20-90 wpm
• Handwriting: 25 wpm
• Pointing: 10-40/min• Eyes free• Hands fee• Mobile• Compact i/o devices
Con• audio messages difficult to
remember if too long. E.g. telephone number, address
• “a drawing can replace a thousand words”
• privacy• sometimes cumbersome.
E.g. control a cursor on a screen
• voice wear-out
Text-to-Speech engines
Processor power & memory
Voice Quality
RealSpeak
RealSpeakCompact
RealSpeakUltraCompact
Human-like
Machine-like
TTS3000
TTS2500
low high
Text-to-Speech engines
TTS2500• Low quality, small footprint engine for talking dictionary products• Available, no additional R&D
TTS3000• Medium quality engines• Limited footprint, high densities• Limited developments
RealSpeak compact• Target: handheld devices
RealSpeak• High-end system
RealSpeak TTS
New generation, human sounding TTS
Target: server based telephony, PCMM
Platform Requirements: • CPU: 48 real time instances PIII 450MHz (8 kHz speech data)• RAM: < 250 kB/instance ROM: 4-6 MB
Speechbases:• 8 kHz uncompressed: ~ 250 MB• 8, 11 kHz compressed: 20 – 30 MB• 22 kHz compressed: 70 – 90 MB
20 languages: US English, 15 European and 4 Asian languages
2 languages under development
RealSpeak Compact
High quality, medium footprint TTS
Target: mobile and embedded platforms
Platform Requirements: • 150 MIPS• RAM: < 250 kB/instance; 4-6 MB common• ROM: 16 MB (includes 11 kHz speechbase)
Derived automatically from RealSpeak
RealSpeak ultra compact under development
TTS3000
Low footprint, highly intelligible TTS engine
Target: Telephony, PCMM, Mobile, Embedded
Platform requirements:• CPU: 20 – 30 MIPS• RAM: 100 kB/instance; ROM: 2 - 3 MB
13 languages including:• US English• 7 European• 3 Asian languages
2 languages under development
TTS2500
Dedicated TTS for very low footprint talking dictionaries
Analysis on 8 or 16 bit processor: <2 Mips
Synthesis on dedicated chip (LH3010 or LH3030 ) or DSP (ADSP21xx)
1.5 MB ROM, 16 KB RAM
Languages:• American English• Mandarin Chinese• Mexican Spanish• German• French
Dimensions of ASR
Speaker• Independent - adaptive - dependent• Native - non-native• Man, woman, child
Recording conditions• Recording device: telephone, GSM, microphone, tape
recorder• Environment: quiet office, home, car, factory, street…
Implementation• Platform: PC, embedded• CPU and memory
Dimensions of ASR
Size of the (active) vocabulary• Small (10-100) - medium (100-1000) - large(>1000) - very large
(>10000)
Flexibility of the vocabulary• Fixed (factory-definable)- User-definable• phoneme based => speaker independent• user words => speaker dependent
Word sequences • Isolated words - sentences - word spotting• Fixed grammar - flexible language model• Discrete - continuous speech
Language• Language independent engine, language dependent data files• Swapping language files
Different Applications, Different Needs
Dictation• Speaker dependent, large vocabulary, continuous speech, quiet office,
PC
Command & control, name dialing• Speaker independent, small to large vocabulary, noise robust, DSP
boards and/or client-server
Dialogue systems• Speaker independent, medium to large vocabulary, noise robust, client-
server
Security: verification• Speaker dependent, combination of password (what) + speaker
characteristics (who)
Language learning• Non-native speakers, punish mistakes rather than being tolerant
Automatic Speech Recognition
L&H speech recognition engines cover a broad range of tasks, processor types, operating systems and input signal types:
Tasks: • Large vocabulary continuous real-time dictation, • Large vocabulary batch transcription, • Grammar-based recognition – large, medium and small vocabularies, • Small-vocabulary isolated word recognition.
Platforms: • PC, • Server, • Handheld, Embedded, • Distributed.
Automatic Speech Recognition engines
Processor power & memory
XCalibur,MREC
VX
Task Complexity
ASR1600ASR1500
Large VocabularyOpen GrammarDictation
Large VocabularyClosed Grammar
Medium VocabularyClosed Grammar
Isolated word recognition
Server
Mobile Terminal
ASR300
ASR100
low high
Recognition engines …
Input conditions: • Environments: home, office, public/industrial, car.• Channels: telephone (wireline, wireless), wideband, mobile
devices.• Microphones: close-talking, far-talking.• Combinations: e.g., broadcast material.
A wide range of processor/memory operating points: • 200Mips/32MB, • 60Mips/1MB, • 20Mips/300KB, • 5-10Mips/<30KB
Recognition engines ….
ASR100: • 5-10 Mips, • < 30 KB, • Speaker-dependent• Recording device: mic./phone• Sampling Frequency: 8/11 kHz• Environment: office• Vocabulary: small and user-adaptable• Grammar: isolated• Speech: Isolated• OS: various• Architecture: stand-alone• Languages: language-independent
Applications• embedded• cell-phone dialing• toys
Recognition engines ….
ASR300: • 20 Mips, • 300 KB, • SI & SD• Sampling Frequency: 8/11 kHz• Vocabulary: small and factory-adaptable• Highly noise robust• Environment: office/car/other noisy environments• Unit: word-dependent• Grammar: isolated• Speech: quasi-connected command and control • OS: various• Architecture: stand-alone• Languages: US English, French, Italian, Korean, German, Japanese
Applications• In car command and control• Command and control of toys, games• Command and control in noisy industrial environments
Recognition engines ….
ASR1500• 60 Mips, • 1MB • SI and speaker-adaptive• Vocabulary: medium size; user-adaptable• Sampling Frequency: 8kHz• Environment: office• Recording Device: telephone/ mobile phone• Grammar: finite state• Speech: Continuous• Unit: phoneme• OS: various• Architecture: Stand-alone and client/server• Languages: US English, 10 European languages, 4 Asian languages
Applications• IVR applications over the phone
• Reverse directory, Automated attendant• Information provider— stock quotes• Ordering systems
• ASR1600 highly noise robust
Recognition engines ….
ASR1600• 60 Mips, • 1MB • SI and speaker-adaptive• Vocabulary: medium size; user-adaptable• Sampling Frequency: 11kHz• Environment: office, car; highly noise-robust• Recording Device: mic.• Grammar: finite state• Speech: Continuous• Unit: phoneme• OS: various• Architecture: Stand-alone and client/server• Languages: US English, 10 European languages, 4 Asian languages
Applications• In-Car recognition
• Command and Control• Embedded devices
• PDA’s, SmartPhones
Recognition engines ….
Mrec/VX: • > 200 Mips• > 64MB• SI and speaker-adaptive • Vocabulary: very large (> 64,000 words)• Sampling Frequency: 22 (16) kHz• Environment: Office• Recording Device: mic.• Grammar: statistical• Speech: Continuous• Unit: phoneme• OS: Windows• Architecture: Stand-alone• Languages: US English and Spanish, 7 European languages, 2 Asian
languages
Applications• document creation, incl. command and control• MediaIndexer (Mrec)• Speech Transcription (Mrec)
Recognition engines ….
Xcalibur • scalable• SI and speaker-adaptive • Vocabulary: very large (> 64,000 words)• Sampling Frequency: 22 (16) kHz• Environment: Office, (Telephony, Car)• Recording Device: mic.• Grammar: statistical and rule-based• Speech: Continuous• Unit: phoneme• OS: Windows• Architecture: Stand-alone and client-server• Languages: Currently only Japanese
Applications• document creation• command and control • Focus on conversational systems