• View

  • Download

Embed Size (px)



Text of Asr

  • 1. A utomaticS peech R ecognition
    • By:
  • Khalid El-Darymli
  • G0327887


  • Automatic Speech Recognition ASR:
    • Definition,
    • capabilities,
    • usage,
    • and milestone.
  • Structure of an ASR system:
    • Speech database.
    • MFCC extraction.
    • Training & Recognition.
  • Conclusions.

3. Multilayer Structure of speech production:

  • [book_airplane_flight] [from_locality] [to_locality] [ departure_time]

[I] [would] [like] [to] [book] [a] [flight] [from] [Rome] [to] [London][tomorrow][morning][book] [b/uh/k] Pragmatic Layer Semantic Layer Syntactic Layer Prosodic/Phonetic Layer Acoustic Layer 4. What isS peechR ecognition ?

  • Process of converting acoustic signal captured by microphone or telephone to a set of words.
  • Recognized words can be final results, as for applications such as commands and control, data entry and document preparation.
  • They can also serve as input to further linguistic processing in order to achieve speech understanding.

5. Capabilities of ASR including:

  • Isolated word recognizers:for segments separated by pauses.
  • Word spotting:algorithms that detect occurrences of key words in continuous speech.
  • Connected words recognizers:that identify uninterrupted, but strictly formatted, sequences of words (e.g. recognition of telephone numbers).
  • Restricted speech understanding:systems that handle sentences relevant to a specific task.
  • Task independent continuous speech recognizers:which is the ultimate goal in this field.
  • Two types of systems:
    • Speaker-dependent:user must provide samples of his/her speech before using them,
    • Speaker independent:no speaker enrollment necessary.

6. Uses and Applications

  • Dictation:This includes medical transcriptions, legal and business dictation, as well as general word processing.
  • Command and Control:ASR systems that are designed to perform functions and actions on the system.
  • Telephony:Some Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones.
  • Wearables:Because inputs are limited for wearable devices, speaking is a natural possibility.
  • Medical/Disabilities:Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text.
  • Embedded Applications:Some newer cellular phones include C&C speech recognition that allow utterances such asCall Home.

7. A Timeline & History of Voice Recognition Software Dragon released discrete word dictation-level speech recognition software. It was the first time dictation speech & voice recognition technology was available to consumers . 1995 SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR) solutions, was founded.1984 Dragon Systems was founded. 1982 DARPA established the Speech Understanding Research (SUR) program. A $3 million per year of government funds for 5 years.It was the largest speech recognition project ever.1971 HMM approach to speech & voice recognition was invented by Lenny Baum of Princeton UniversityEarly 1970's AT&T's Bell Labs produced the first electronic speech synthesizer called the Voder.1936 8. timelinecontinue Scansoft, Inc. is presently the world leader in the technology of Speech Recognition in the commercial market. ScanSoft Ships Dragon NaturallySpeaking 7 Medical, Lowers Healthcare Costs through Highly Accurate Speech Recognition.2003 Lernout & Hauspie acquired Dragon Systems for approximately $460 million.2000 Microsoft invested $45 million to allow Microsoft to use speech & voice recognition technology in their systems.1998 Dragon introduced "Naturally Speaking", the first "continuous speech" dictation software available1997 9. The Structure of ASR System: Functional Scheme of an ASR System Speech samples X Y S W * Database SignalInterface Feature Extraction Recognition Databases Training HMM 10. Speech Database:

  • A speech database is a collection of recorded speech accessible on a computer and supported with the necessary annotations and transcriptions.
  • The databases collect the observations required for parameter estimations.
  • The corpora has to be large enough to cover the variability of speech.

11. Transcription of speech:

  • Ex.:
  • The graph below shows an acoustic waveform for the sentence:how much allowance.
  • Speech data aresegmentedandlabeledwith the phoneme string/h# hh ak m ah tcl cj ax l aw ax n s /
  • It is linguistic information associated to digital recordings of acoustic signals.
  • This symbolic representation of speech used to easily retrieve the content of the databases.
  • Transcription involving:
  • -SegmentationandLabeling .

Segmentation and labeling example 12. Many databases are distributed by theLinguistic Data Consortium 13. Speech Signal Analysis Feature Extraction for ASR: - The aim is to extract the voice features to distinguish different phonemes of a language. 14. MFCC extraction:

  • Pre-emphasis:to obtain similar amplitude for all formants.
  • Windowing:to enable spectral evaluation our signal has to be stationary, accordingly, windowing will make it possible to perform spectral evaluation on a short time periods.

Pre-emphasis DFT Mel filter banks Log(|| 2 ) IDFT Speech signal x(n) WINDOW x (n) x t(n) X t (k) Y t (m) MFCC y t (m) (k) 15. Spectral Analysis:

  • DFT: X t (k)=X t (e j2 k/N ), k=0, , N-1
  • Filter Bank processing:to obtain the spectral feature of speech thru properly integrating spectrum at defined frequency ranges.
    • A set of24 BPFis generally used since it simulates human ear processing.
    • The most widely used scale is the Mel scale.
  • UP to here the procedure has the role of smoothing the spectrum and performing a processing that similar to that executed by the human ear
  • - Log(|| 2 ) :
  • -Again this process performed by the human ear as well.
  • The magnitudewill discard the useless phase information.
  • Logarithmperforms a dynamic compression making feature extraction less sensitive to variations in dynamics.

16. Speech waveform of a phoneme ae

  • Explanatory Example

After pre-emphasis and Hamming windowing Power spectrum MFCC 17. TrainingandRecognition :

  • Training:
  • - The model is built from a large number of different correspondences (X , W ).
  • - This is the same training procedure of a baby.
  • -The greater the number of couples, the greater is the recognition accuracy.
  • Recognition:
  • - All the possible sequences of wordsW are tested to find the W *whose acoustic sequence X=h (W * , ) best matches the one given.

18. Deterministicvs.Stochasticframework:

  • Deterministic framework:
  • DTW:One or more acoustic templates are memorized per word.
  • Drawback :This is not sufficient to represent all the speech variability, i.e. all the X associated to a word.
  • Stochastic framework:
  • The knowledge is embedded stochastically.
  • This allows us to consider a model like ( HMM ) that takes more correspondences (X ,W ) into account.

19. ImplementingHMMto speech Modeling TrainingandRecognition

  • The recognition procedure may be divided into two distinct stages:
  • - Building HMM speech models based on the correspondence between the observation sequencesYand the state sequence ( S ). (TRAINNING).
  • - Recognizing speech by the stored HMM modelsand by the actual observation Y. (RECOGNITION)

Training HMM FeatureExtraction Recognition W * Y Y S Speech Samples 20. Implementation of HMM:

  • HMM of a simple grammar:
  • sil ,NO ,YES

P(w t =yesw t-1 =sil)=0.2 P(w t =sil|w t-1 =yes)=1 P(w t =sil|w t-1 =no)=1 P(w t =now t-1 =sil)=0.2 P(s t s t-1 ) s (0) Silence Start S (1) S (2) S (3) S (4) S (5) S (6) S (7) S (8) S (9) S (10) S (11) S (12) Phoneme YE Phoneme S w= YES w= NO Phoneme N Phoneme O P(Ys t =s (9) ) Y 0.6 21. The search Algorithm:

  • Hypothesis tree of the Viterbi search algorithm

s (0) s (7) s (0) s (1) s (8) s (7) s (0) s (1) s (2) Time=1 Time=2 Time=3 0.1 0.4 0.1 0.025 0.021 0.051 0.041 0.045 0.036 0.032 22. Conclusions:

  • Modern speech understanding systems merge interdisciplinary technologies from Signal Processing, Pattern Recognition, Natural Language, and Linguistics into a unified statistical framework.
  • Voice commanded a