Asr

AAutomatic utomatic SSpeechpeech RRecognitionecognition

By:By:

Khalid El-DarymliKhalid El-DarymliG0327887G0327887

OUTLINEOUTLINE Automatic Speech Recognition Automatic Speech Recognition ASR:ASR:

Definition,Definition, capabilities,capabilities, usage,usage, and milestone.and milestone.

Structure of an ASR system:Structure of an ASR system: Speech database.Speech database. MFCC extraction.MFCC extraction. Training & Recognition.Training & Recognition.

Conclusions.Conclusions.

Multilayer Structure of speech Multilayer Structure of speech production:production:

[book_airplane_flight] [from_locality] [book_airplane_flight] [from_locality] [to_locality] [ departure_time][to_locality] [ departure_time]

[I] [would] [like] [to] [book] [a] [I] [would] [like] [to] [book] [a] [flight] [from] [Rome] [to] [flight] [from] [Rome] [to] [London][tomorrow][morning] [London][tomorrow][morning]

[book][book][b/uh/[b/uh/k]k]

Pragmatic Layer

Semantic Layer

Syntactic Layer

Prosodic/Phonetic Layer

Acoustic Layer

What is What is SSpeech peech RRecognitionecognition??

Process of converting acoustic Process of converting acoustic signal captured by microphone or signal captured by microphone or telephone to a set of words.telephone to a set of words. Recognized words can be final Recognized words can be final results, as for applications such as results, as for applications such as commands and control, data entry commands and control, data entry and document preparation.and document preparation. They can also serve as input to They can also serve as input to further linguistic processing in order further linguistic processing in order to achieve speech understanding.to achieve speech understanding.

Capabilities of ASR Capabilities of ASR including:including: Isolated word recognizers:Isolated word recognizers: for segments separated for segments separated

by pauses.by pauses. Word spotting:Word spotting: algorithms that detect occurrences algorithms that detect occurrences

of key words in continuous speech.of key words in continuous speech. Connected words recognizers:Connected words recognizers: that identify that identify

uninterrupted, but strictly formatted, sequences of uninterrupted, but strictly formatted, sequences of words (e.g. recognition of telephone numbers).words (e.g. recognition of telephone numbers).

Restricted speech understanding:Restricted speech understanding: systems that systems that handle sentences relevant to a specific task.handle sentences relevant to a specific task.

Task independent continuous speech Task independent continuous speech recognizers:recognizers: which is the ultimate goal in this field. which is the ultimate goal in this field.

Two types of systems:Two types of systems: Speaker-dependent:Speaker-dependent: user must provide samples user must provide samples

of his/her speech before using them,of his/her speech before using them, Speaker independent:Speaker independent: no speaker enrollment no speaker enrollment

necessary.necessary.

Uses and Applications Uses and Applications Dictation:Dictation: This includes medical transcriptions, legal This includes medical transcriptions, legal

and business dictation, as well as general word and business dictation, as well as general word processing. processing.

Command and Control:Command and Control: ASR systems that are designed ASR systems that are designed to perform functions and actions on the system.to perform functions and actions on the system.

Telephony:Telephony: Some Voice Mail systems allow callers to Some Voice Mail systems allow callers to speak commands instead of pressing buttons to send speak commands instead of pressing buttons to send specific tones.specific tones.

Wearables:Wearables: Because inputs are limited for wearable Because inputs are limited for wearable devices, speaking is a natural possibility.devices, speaking is a natural possibility.

Medical/Disabilities:Medical/Disabilities: Many people have difficulty Many people have difficulty typing due to physical limitations such as repetitive typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could others. For example, people with difficulty hearing could use a system connected to their telephone to convert the use a system connected to their telephone to convert the caller's speech to text.caller's speech to text.

Embedded Applications:Embedded Applications: Some newer cellular phones Some newer cellular phones include C&C speech recognition that allow utterances include C&C speech recognition that allow utterances such as such as “Call Home““Call Home“ . .

A Timeline & History of Voice A Timeline & History of Voice Recognition SoftwareRecognition Software

19319366

AT&T's Bell Labs produced the first electronic AT&T's Bell Labs produced the first electronic speech synthesizer called the Voder. speech synthesizer called the Voder.

EarlEarly y

1971970's0's

HMM approach to speech & voice recognition HMM approach to speech & voice recognition was invented by Lenny Baum of Princeton was invented by Lenny Baum of Princeton University University

19719711

DARPA established the Speech Understanding DARPA established the Speech Understanding Research (SUR) program. A $3 million per Research (SUR) program. A $3 million per year of government funds for 5 years. It was year of government funds for 5 years. It was the largest speech recognition project ever. the largest speech recognition project ever.

19819822

Dragon Systems was founded.Dragon Systems was founded.

19819844

SpeechWorks, the leading provider of over-the-SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR) telephone automated speech recognition (ASR) solutions, was founded. solutions, was founded.

19919955

Dragon released discrete word dictation-level Dragon released discrete word dictation-level speech recognition software. It was the first speech recognition software. It was the first time dictation speech & voice recognition time dictation speech & voice recognition

technology was available to consumerstechnology was available to consumers..

……timeline…continuetimeline…continue

19971997 Dragon introduced "Naturally Speaking", the Dragon introduced "Naturally Speaking", the first "continuous speech" dictation software first "continuous speech" dictation software available available

19981998 Microsoft invested $45 million to allow Microsoft invested $45 million to allow Microsoft to use speech & voice recognition Microsoft to use speech & voice recognition technology in their systems. technology in their systems.

20002000 Lernout & Hauspie acquired Dragon Lernout & Hauspie acquired Dragon Systems for approximately $460 million. Systems for approximately $460 million.

20032003 ScanSoft Ships Dragon NaturallySpeaking 7 Medical, ScanSoft Ships Dragon NaturallySpeaking 7 Medical, Lowers Healthcare Costs through Highly Accurate Lowers Healthcare Costs through Highly Accurate Speech Recognition. Speech Recognition. Scansoft, Inc. is presently the world

leader in the technology of Speech Recognition in the commercial market.

Database Signal Interface

Feature Extraction

Recognition

DatabasesTraining HMM

The Structure of ASR The Structure of ASR System:System:

Functional Scheme of an ASR Functional Scheme of an ASR SystemSystem

Speech samples

X Y

S

W*

Speech Database:Speech Database:-A speech database is a collection A speech database is a collection of recorded speech accessible on a of recorded speech accessible on a computer and supported with the computer and supported with the necessary annotations and necessary annotations and transcriptions.transcriptions.-The databases collect the The databases collect the observations required for observations required for parameter estimations.parameter estimations.-The corpora has to be large The corpora has to be large enough to cover the variability of enough to cover the variability of speech.speech.

Transcription of speech:Transcription of speech:

Ex.:Ex.:The graph below shows an acoustic waveform for the sentence: The graph below shows an acoustic waveform for the sentence: how how

much allowance.much allowance.-Speech data are Speech data are segmentedsegmented and and labeledlabeled with the phoneme string with the phoneme string “ “ /h# hh ak m ah tcl cj ax l aw ax n s / /h# hh ak m ah tcl cj ax l aw ax n s / ””

-It is linguistic information associated to digital recordings It is linguistic information associated to digital recordings of acoustic signals.of acoustic signals.-This symbolic representation of speech used to easily This symbolic representation of speech used to easily retrieve the content of the databases.retrieve the content of the databases.-Transcription involving:Transcription involving:

- - SegmentationSegmentation and and LabelingLabeling..

Segmentation and labeling exampleSegmentation and labeling example

Many databases are distributed by the Many databases are distributed by the Linguistic Data ConsortiumLinguistic Data Consortium

www.ldc.upenn.eduwww.ldc.upenn.edu

Speech Signal AnalysisSpeech Signal Analysis

Feature Extraction for Feature Extraction for ASR:ASR:

- The aim is to extract the voice - The aim is to extract the voice features to distinguish different features to distinguish different phonemes of a language.phonemes of a language.

MFCC extraction:MFCC extraction:

-Pre-emphasis:Pre-emphasis: to obtain similar to obtain similar amplitude for all formants.amplitude for all formants.-Windowing:Windowing: to enable spectral to enable spectral evaluation our signal has to be evaluation our signal has to be stationary, accordingly, windowing will stationary, accordingly, windowing will make it possible to perform spectral make it possible to perform spectral evaluation on a short time periods.evaluation on a short time periods.

Pre-emphasis DFTMel filter

banks Log(||2) IDFT

Speech

signalx(n)

WINDOW

x’(n)

xt (n)

Xt(k)

Yt(m)

MFCCyt(m)(k)

Spectral Analysis:Spectral Analysis:-DFT:DFT: XXtt(k)(k) =X =Xtt(e(ej2j2k/Nk/N), k=0, …, N-1), k=0, …, N-1-Filter Bank processing:Filter Bank processing: to obtain the spectral to obtain the spectral feature of speech thru properly integrating feature of speech thru properly integrating spectrum at defined frequency ranges.spectrum at defined frequency ranges.

-A set of A set of 24 BPF24 BPF is generally used since it simulates is generally used since it simulates human ear processing.human ear processing.-The most widely used scale is the Mel scale.The most widely used scale is the Mel scale.

UP to here the procedure has the role of UP to here the procedure has the role of smoothing the spectrum and performing a smoothing the spectrum and performing a processing that similar to that executed by processing that similar to that executed by

the human earthe human ear- Log(||- Log(||22)) ::-Again this process performed by the human ear as -Again this process performed by the human ear as well.well.-The magnitudeThe magnitude will discard the useless phase will discard the useless phase information.information.-LogarithmLogarithm performs a dynamic compression performs a dynamic compression making feature extraction less sensitive to making feature extraction less sensitive to variations in dynamics.variations in dynamics.

Speech waveform Speech waveform of a phoneme “\of a phoneme “\

ae”ae”

After pre-emphasis After pre-emphasis and Hamming and Hamming

windowingwindowing

Power spectrumPower spectrum MFCCMFCC

Explanatory ExampleExplanatory Example

TrainingTraining and and RecognitionRecognition::

Training:Training: - The model is built from a large number of - The model is built from a large number of

different correspondences (Xdifferent correspondences (X’’, W, W’’).).- This is the same training procedure of a - This is the same training procedure of a

baby.baby.- The greater the number of couples, the - The greater the number of couples, the

greater is the recognition accuracy.greater is the recognition accuracy. Recognition:Recognition: - All the possible sequences of words W are - All the possible sequences of words W are

tested to find the Wtested to find the W** whose acoustic whose acoustic sequence X=h (Wsequence X=h (W**,,) best matches the one ) best matches the one given.given.

DeterministicDeterministic vs. vs. StochasticStochastic framework:framework:

Deterministic framework:Deterministic framework: DTW:DTW: One or more acoustic templates are One or more acoustic templates are

memorized per word.memorized per word. DrawbackDrawback:: This is not sufficient to This is not sufficient to

represent all the speech variability, i.e. all represent all the speech variability, i.e. all the X associated to a word.the X associated to a word.

Stochastic framework:Stochastic framework: The knowledge is embedded stochastically.The knowledge is embedded stochastically. This allows us to consider a model like This allows us to consider a model like

((HMMHMM) that takes more correspondences ) that takes more correspondences (X(X’’,W,W’’) into account.) into account.

Implementing Implementing HMMHMM to speech to speech ModelingModeling

TrainingTraining and and RecognitionRecognition The recognition procedure may be divided into two The recognition procedure may be divided into two distinct stages:distinct stages:- Building HMM speech models based on the - Building HMM speech models based on the correspondence between the observation correspondence between the observation sequences sequences YY and the state sequence ( and the state sequence (SS). ). (TRAINNING).(TRAINNING).- Recognizing speech by the stored HMM models - Recognizing speech by the stored HMM models and by the actual observation Y. and by the actual observation Y. (RECOGNITION)(RECOGNITION)

Training HMM

Feature Extraction

RecognitionW*Y

Y

S

Speech Sample

s

Implementation of HMM:Implementation of HMM: HMM of a simple grammar: HMM of a simple grammar:

“ “ \sil\sil, , NONO, , YESYES””

P(w t=ye

s\wt-1=\

sil)=

0.2

P(wt=\sil|w

t-1=yes)=1

P(w t=\sil

|w t-

1=no)=1

P(wt =no\w

t-1 =\

sil)=0.2

P(st\st-1)

s(0)

Silence

Start

S(1) S(2) S(3) S(4) S(5) S(6)

S(7) S(8) S(9) S(10) S(11) S(12)

Phoneme ‘YE’

Phoneme ‘S’w=YE

S

w=NOPhoneme ‘N’

Phoneme ‘O’

P(Y\st=s(9))

Y

0.6

The search Algorithm:The search Algorithm: Hypothesis tree of the Viterbi search algorithmHypothesis tree of the Viterbi search algorithm

s(0)

s(7)

s(0)

s(1)

s(8)

s(7)

s(0)

s(1)

s(2)

Time=1 Time=2 Time=3

0.1

0.4

0.1

0.025

0.0210.051

0.041

0.045

0.036

0.032

Conclusions:Conclusions: Modern speech understanding systems merge

interdisciplinary technologies from Signal Processing, Pattern Recognition, Natural Language, and Linguistics into a unified statistical framework.

Voice commanded applications are expected to cover many of the aspects of our future daily life.

Car computers, telephones and general appliances are the more likely candidates for this revolution that may reduce drastically the use of the keyboard.

Speech Recognition is nowadays regarded by market projections as one of the more promising technologies of the future.

That is easily realized by taking a look into the industrial product sales which rose from $500 million in 1997 to $38 billion in 2003.

Technology

Asr