Speech and Language Modeling Shaz Husain Albert Kalim Kevin Leung Nathan Liang

Speech and Language ModelingSpeech and Language Modeling

Shaz HusainAlbert KalimKevin LeungNathan Liang

Voice Recognition Voice Recognition

The field of Computer Science that deals with designing computer systems that can recognize spoken words.

Voice Recognition implies only that the computer can take dictation, not that it understands what is being said.

Voice Recognition (continued)Voice Recognition (continued)

A number of voice recognition systems are available on the market. The most powerful can recognize thousands of words.

However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent.


Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems.

Recently, great strides have been made in continuous speech systems -- voice recognition systems that allow you to speak naturally. There are now several continuous-speech systems available for personal computers.


Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialized situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset.

Increasingly, however, as the cost decreases and performance improves, speech recognition systems are entering the mainstream and are being used as an alternative to keyboards

Natural Language ProcessingNatural Language Processing Comprehending human languages falls under a different

field of computer science called natural language processing.

Natural Language: human language. English, French, and Mandarin are natural languages. Computer languages, such as FORTRAN and C, are not.

Probably the single most challenging problem in Computer Science is to develop computers that can understand natural languages. So far, the complete solution to this problem has proved elusive, although a great deal of progress has been made.

Proteus Project Proteus Project

At New York University, members of the Proteus Project have been doing Natural Language Processing (NLP) research since the 1960's.

Basic Research: Grammars and Parsers, Translation Models, Domain-Specific Language, Bitext Maps and Alignment, Evaluation Methodologies, Paraphrasing, and Predicate-Argument Structure.

Proteus Project: Proteus Project: Grammars and ParsersGrammars and Parsers Grammars are models of linguistic structure. Parsers are algorithms that infer

linguistic structure, given a grammar and a linguistic expression.

Given a grammar, we can design a parser to infer structure from linguistic data. Also, given some parsed data, we can learn a grammar.

Example of Research Applications: Apple Pie Parser for English. For example, I love an apple pie will be parsed as

(S (NP (PRP I)) (VP (VBP love) (NP (DT an) (NN apple) (NN pie))) (. -PERIOD-))

Web-based application: http://complingone.georgetown.edu/~linguist/applepie.html

Proteus Project: Proteus Project: Translation ModelsTranslation Models Translation models describe the abstract/mathematical

relationship between two or more languages.

Also called models of translational equivalence because the main thing that they aim to predict is whether expressions in different languages have equivalent meanings.

A good translation model is the key to many trans-lingual applications, the most famous of which is machine translation.

Proteus Project: Proteus Project: Domain-specific LanguageDomain-specific Language Sentences in different domains of

discourse are structurally different.

For example, imperative sentences are common in computer manuals, but not in annual company reports. It would be useful to characterize these differences in a systematic way.

Proteus Project: Proteus Project: Bitext Maps and AlignmentBitext Maps and Alignment A "bitext" consists of two texts that are

mutual translations. A bitext map is a description of the

correspondence relation between elements of the two halves of a bitext.

Finding such a map is the first step to building translation models. It is also the first step in applications like automatic detection of omissions in translations.

Proteus Project: Proteus Project: Evaluation MethodologiesEvaluation Methodologies

There are many correct ways to say almost anything, and many shades of meaning. This "ambiguity" of natural languages makes the evaluation of NLP systems difficult enough to be a research topic in itself.

Proteus Project has invented new evaluation methods in two areas of NLP where evaluation is notoriously difficult: translation modeling and word sense disambiguation. An example of research applications: General Text Matcher (GTM). GTM measures the similarity between texts.

Simple Applet for GTM: http://nlp.cs.nyu.edu/call_gtm.html

Proteus Project: Proteus Project: ParaphrasingParaphrasing A paraphrase relation exists between two phrases which

convey the same information. The recognition of paraphrases is an essential part of many

natural language applications: if we want to process text reporting fact "X", we need to know all the alternative ways in which "X" can be expressed.

Capturing paraphrases by hand is an almost overwhelming task because they are so common and many are domain specific.

Therefore, Project Proteus begun to develop procedures which learn paraphrase from text. The basic idea is that they look for news stories from the same day which report the same event, and then examine the different ways in which the same fact gets reported

Proteus Project: Proteus Project: Predicate-Argument StructurePredicate-Argument StructureAn analysis of sentences in terms of

predicates and arguments. It is a "deeper" level of linguistic

analysis than constituent structure or simple dependency structure, in particular one that regularizes over nearly equivalent surface strings.

Language ModelingLanguage Modeling

A bad language model

Language Modeling (continued)Language Modeling (continued)

Language Modeling (continued)Language Modeling (continued)

Language Modeling: Language Modeling: IntroductionIntroductionLanguage modeling– One of the basic tasks to build a speech

recognition system– help a speech recognizer figure out how

likely a word sequence is, independent of the acoustics.

– lets the recognizer make the right guess when two different sentences sound the same.

Basics of Language Modeling Basics of Language Modeling

Language modeling has been studied under two different points of view. – First, as a problem of grammar inference:

• the model has to discriminate the sentences which belong to the language from those which do not belong.

– Second, as a problem of probability estimation.• If the model is used to recognize the decision is

usually based on the maximum a posteriori rule. The best sentence L is chosen so that the probability of the sentence, knowing the observations O, is maximized.

What is a Language ModelWhat is a Language Model

A Language model is a probability distribution over word sequences– P(“And nothing but the truth”) 0.001– P(“And nuts sing on the roof”) 0

How Language Models workHow Language Models work

Hard to compute – P(“And nothing but the truth”)

Decompose probability– P(“And nothing but the truth) =

P(“And”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)

Types of Language ModelingTypes of Language Modeling

Statistical Language ModelingN-grams/ Trigrams Language

ModelingStructured Language Modeling

Statistical Language ModelStatistical Language Model

A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence.

The Trigram / N-grams LMThe Trigram / N-grams LM

Assume each word depends only on the previous two/n-1 words (three words total – tri means three, gram means writing)– P(“the|… whole truth and nothing but”)

P(“the|nothing but”)– P(“truth|… whole truth and nothing but

the”) P(“truth|but the”)

Structured Language ModelsStructured Language Models

Language has structure – noun phrases, verb phrases, etc.

Use structure of language to detect long distance information

Promising resultsBut: time consuming; language is

right branching

EvaluationEvaluation

Perplexity - is geometric average inverse probability

– measures language model difficulty, not acoustic difficulty.

– Lower the perplexity, the closer we are to true model.

Language Modeling TechniquesLanguage Modeling Techniques

Smoothing – addresses the problem of data sparsity: there is rarely

enough data to accurately estimate the parameters of a language model.

– gives a way to combine less specific, more accurate information with more specific, but noisier data

– Eg. deleted interpolation and Katz (or Good-Turing) smoothing, Modified Kneser-Ney smoothing

Caching– is a widely used technique that uses the observation

that recently observed words are likely to occur again. Models from recently observed data can be combined with more general models to improve performance.

LM Techniques (continued)LM Techniques (continued) Skipping models

– use the observation that even words that are not directly adjacent to the target word contain useful information.

Sentence-mixture models – use the observation that there are many different kinds of

sentences. By modeling each sentence type separately, performance is improved.

Clustering – Words can be grouped together into clusters through various

automatic techniques; then the probability of a cluster can be predicted instead of the probability of the word.

– can be used to make smaller models or better performing ones.

Smoothing: Smoothing: Finding Parameter ValuesFinding Parameter Values Split data into training, “heldout”, test Try lots of different values for on

heldout data, pick best Test on test data Sometimes, can use tricks like “EM”

(estimation maximization) to find values Heldout should have (at least) 100-1000

words per parameter. enough test data to be statistically

significant. (1000s of words perhaps)

Caching: Real LifeCaching: Real Life

Someone says “I swear to tell the truth” System hears “I swerve to smell the soup” Cache remembers! Person says “The whole truth”, and, with

cache, system hears “The whole soup.” – errors are locked in.

Caching works well when users corrects as they go, poorly or even hurts without correction.

CachingCaching

If you say something, you are likely to say it again later.

Interpolate trigram with cache

SkippingSkipping

P(z|…rstuvwxy) P(z|vwxy)Why not P(z|v_xy) – “skipping” n-

gram – skips value of 3-back word.Example: “P(time|show John a

good)” ->P(time | show ____ a good)

P(…rstuvwxy) P(z|vwxy) + P(z|vw_y) + (1--)P(z|v_xy)

ClusteringClustering

CLUSTERING = CLASSES (same thing) What is P(“Tuesday | party on”) Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration on”) Put words in clusters: – WEEKDAY = Sunday, Monday, Tuesday, …– EVENT=party, celebration, birthday, …

Predictive Clustering ExamplePredictive Clustering Example

Find P(Tuesday | party on) – Psmooth (WEEKDAY | party on) Psmooth (Tuesday | party on WEEKDAY)– C( party on Tuesday) = 0– C(party on Wednesday) = 10– C(arriving on Tuesday) = 10– C(on Tuesday) = 100

Psmooth (WEEKDAY | party on) is high Psmooth (Tuesday | party on WEEKDAY) backs off

to Psmooth (Tuesday | on WEEKDAY)

Microsoft Language Modeling Microsoft Language Modeling Research Research

Microsoft language modeling research falls into several categories:

Language Model Adaptation. Natural language technology in general and language models in particular are very brittle when moving from one domain to another. Current statistical language models are built from text specific to newspapers and TV/radio broadcasts which has little to do with the everyday use of language by a particular individual. We are investigating means of adapting a general-domain statistical language model to a new domain/user when we have access to limited amounts of sample data from the new domain/user.

Microsoft Language Modeling Microsoft Language Modeling ResearchResearch Can Syntactic Structure Help? Current language

models make no use of the syntactic properties of natural language but rather use very simple statistics such as word co-occurences. Recent results show that incorporating syntactic constraints in a statistical language model reduces the word erroror rate on a conventional dictation task by 10% . We are working on finding the best way of "putting language into language models" as well as exploring the new possibilities opened by such structured language models for other tasks such as speech and language understanding.

Microsoft Language Modeling Microsoft Language Modeling ResearchResearch Speech Utterance Classification A simple first step to more

natural user interfaces in interactive voice response systems is automated call routing. Instead of listening to prompts like "If you are trying to reach department X say Yes, otherwise say No" or punching keys on your telephone keypad, one could simply state in a sentence what the problem is, for example "There is a fraudulous transaction on my last statement" and get connected to the right customer service representative. We are developing technology that aims at classifying speech utterances in a limited set of classes, enhancing the role of the traditional language model such that it also assigns a category to a given utterance

Microsoft Language Modeling Microsoft Language Modeling ResearchResearch Building the best language models we

can. In general, the better the language model, the lower the error rate of the speech recognizer. By putting together the best results available on language modeling, we have created a language model that outperforms a standard baseline by 45%, leading to a 10% reduction in error rate for our speech recognizer. The system has the best reported results of any language model.

Microsoft Language Modeling Microsoft Language Modeling ResearchResearch Language modeling for other applications.

Speech recognition is not the only use for language models. They are also useful in fields like handwriting recognition, spelling correction, even typing Chinese! Like speech recognition, all of these are areas where the input is ambiguous in some way, and a language model can help us guess the most likely input. We're also working on finding new uses for language models, in other areas.

Microsoft Speech Software Microsoft Speech Software Development KitDevelopment Kit enables developers to create, debug and

deploy speech-enabled ASP.NET Web applications intended for deployment to a Microsoft Speech Server.

applications are designed for devices ranging from telephones to Windows Mobile™-based devices and desktop PCs.

Speech Application Language Tags (SALT)Speech Application Language Tags (SALT)

SALT is an XML based API that brings speech interactions to the Web.

SALT is an extension of HTML and other markup languages (cHTML, XHTML, WML) that adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model. These tags are designed to be used for both voice-only browsers (for example, a browser accessed over the telephone) and multimodal browsers.

SALT is a small set of XML elements, with associated attributes and DOM object properties, events, and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document, so SALT can be used equally effectively within HTML and all its flavors, or with WML, or with any other SGML-derived markup.

What kind of applications can we What kind of applications can we build with SALT?build with SALT?SALT can be used to add speech

recognition and synthesis and telephony capabilities to HTML or XHTML based applications, making them accessible from telephones or other GUI–based devices such as PCs, telephones, tablet PCs and wireless personal digital assistants (PDAs).

XML (Extensible Markup Language)XML (Extensible Markup Language)

XML is a collection of protocols for representing structured data in a text format that makes it straightforward to interchange XML documents on different computer systems.

XML allows new markups. XML contains sets of data structures.

They can be transformed into appropriate formats like XSL or XSLT.

The main top-level elementsThe main top-level elements

<prompt …>– For speech synthesis configuration and prompt playing

<listen …>– For speech recognizer configuration, recognition

execution and post-processing, and recording <dtmf …>

– For configuration and control of DTMF collection <smex …>

– for general-purpose communnication with platform components

The input elements <listen> and <dtmf> also The input elements <listen> and <dtmf> also contain grammars and binding controlscontain grammars and binding controls

<grammar …>– For specifying input grammar resources

<bind …>– For processing of recognition results

<record …>– For recording audio input

Speech Library ExampleSpeech Library Example

Speech Library ExampleSpeech Library Example

ExampleExample

<input name=”Date” type=”Dates” /><input name=”PersonToMeet” type=”text” /><input name=”Duration” type=”time” />…<prompt …> Schedule a meeting

<value targetElement=”Date”/> Date<value targetElement=”Duration”/> Duration<value targetElement=”PersonToMeet”/> Person

</prompt>

<listen …> <grammar …/><bind test=”/@confidence $lt$ 50”targetElement=”prompt_confirm” targetMethod=”start”targetElement=”listen_confirm” targetMethod=”start” /><bind test=”/@confidence $ge$ 50”targetElement=”Date” value=”//Meeting/Date”/>targetElement=”Duration” value=”//Meeting/Duration”/>targetElement=”PersonToMeet” value=”//Meeting/Person” /> …</listen>

Example (continued)Example (continued)<rule name=”MeetingProperties”/>

<l> <ruleref name=”Date”/> <ruleref name=”Duration”/> <ruleref name=”Time”/> <ruleref name=”Person”/> <ruleref name=”Subject”/>.. .. </l> <o> <ruleref name=”Meeting”/> </o>

<output> <Calendar:meeting> <DateTIme>

<xsl:apply-templates name=“DayOfWeek”/><xsl:apply-templates name=“Time”/><xsl:apply-templates name=“Duration”/>

</DateTIme> <PersonToMeet>

<xsl:apply-templates name=“Person”/> </PersonToMeet>

</Calendar:meeting> </output></rule>

<l propname=”DayOfWeek”> Sunday Monday first day .. .. .. Saturday </l>Voice: mondayGenerates an XML element:<DayOfWeek text=”first day”>Mon</DayOfWeek>

 CEO Nathan boss programmer ……Voice: CEO, Generates:<Person text=“CEO”>Nathan</Person>

XML ResultXML Result

<calendar:meeting text=”…”><DateTime text=”…”>

<DateOfWeek text=”…”>Monday</DateOfWeek>

<Time text=”…”>2:00</Time><Duration text=“…”>3600</Duration>

</DateTime><Person>Nathan</Person>

</calendar:meeting>

How SALT WorksHow SALT Works Multimodal

– For multimodal applications, SALT can be added to a visual page to support speech input and/or output. This is a way to speech-enable individual controls for 'push-to-talk' form-filling scenarios, or to add more complex mixed initiative capabilities if necessary.

– A SALT recognition may be started by a browser event such as pen-down on a textbox, for example, which activates a grammar relevant to the textbox, and binds the recognition result in the textbox.

Telephony– For applications without a visual display, SALT manages the

interactional flow of the dialog and the extent of user initiative by using the HTML eventing and scripting model.

– In this way, the full programmatic control of client-side (or server-side) code is available to application authors for the management of prompt playing and grammar activation.

Sample Implementation ArchitectureSample Implementation Architecture

A Web server. This Web server generates Web pages containing HTML, SALT, and embedded script. The script controls the dialog flow for voice-only interactions. For example, the script defines the order for playing the audio prompts to the caller assuming there are several prompts on a page.

A telephony server. This telephony server connects to the telephone network. The server incorporates a voice browser interpreting the HTML, SALT, and script. The browser can run in a separate process or thread for each caller. Of course, the voice browser interprets only a subset of HTML since much of HTML refers to GUI and is not relevant to a voice browser.

A speech server. This speech server recognizes speech, plays audio prompts, and responses back to the user.

The client device. Clients include, for example, a Pocket PC or desktop PC running a version of Internet Explorer capable of interpreting HTML and SALT.

SALT ArchitectureSALT Architecture

Multimodal Interactive Notepad Multimodal Interactive Notepad (MiPad)(MiPad)

Mipad's speech input addresses the defects of the handheld, such as the struggle to wrap your hands around a small pen and hit the tiny target known as an on-screen keyboard.

Some of the current limitations of speech recognition: background noise, multiple users, accents, and idioms, can be helped with pen input.

MiPadMiPad

What does it do?What does it do? MiPad cleverly sidesteps some of the problems of speech technology by letting the

user touch the pen to a field on the screen, directing the speech recognition engine to expect certain types of input. The Speech group calls this technology "Tap and Talk." If you're sending an e-mail, and you tap the "To" field with the pen before you speak, the system knows to expect a name. It won't try to translate "Helena Bayer" into "Hello there." The semantic information related to this field is limited, leading to a reduced error rate.

On the other hand, if you're filling in the subject field and using free-text dictation, the engine behind MiPad knows to expect anything. This is where the "Tap and Talk" technology comes in handy again. If the speech recognition engine has translated your spoken "I saw a bear," into the text "I saw a hair," you can use the stylus to tap on the word "hair" and repeat "bear," to correct the input. This focused correction, an evolution of the mouse pointer, is easy and painless compared to having to re-type or repeat the complete sentence.

The "Tap and Talk" interface is always available on your MiPad device. The user can give spontaneous commands by tapping the Command button and talking to the handheld. You might tell your MiPad device, "I want to make an appointment," and the MiPad will obediently bring up an appointment form for you to fill in with speech, pen, or both.

Some Projects on Their WaySome Projects on Their Way

Projects for Speech RecognitionProjects for Speech Recognition

Robust techniques for speech recognition in noisy environments (Funded by EPSRC and Bluechip Technologies Ltd, Belfast)

Improved large-vocabulary speech recognition using syllables

Multi-modal techniques for improved speech recognition (e.g., combining audio and visual information)

Projects for Speech RecognitionProjects for Speech Recognition

Decision-tree unified multi-resolution models for speech communication on mobile devices in noisy environments (Funded by EPSRC in collaboration.)

Modeling Voice, Accent and Emotion in Text to Speech Synthesis(Funded by EPSRC, in collaboration)

TCS Programme No 3191(In collaboration with Bluechip Technologies Ltd, Belfast)

Projects for Language ModelingProjects for Language Modeling

Development and Integration of Statistical Speech and Language Models(Funded by EPSRC)

Comparison of Human and Statistical Language Model Performance(Funded by EPSRC)

Improved statistical language modeling through the use of domains

Modeling individual words as a means of increasing the predictive power of a language model

Robust techniques for speech recognition in Robust techniques for speech recognition in noisy environments noisy environments (Funded by EPSRC)(Funded by EPSRC)

Speech recognition degrades dramatically when a mismatch occurs between training and operating conditions.

Mismatch due to ambient or communications-channel noise.

Focus on robust signal pre-processing. Assume knowledge about the noise or the

environment.

Robust techniques for speech recognition in Robust techniques for speech recognition in noisy environments noisy environments (Funded by EPSRC)(Funded by EPSRC)

Frequency-band corruptionPartial-time duration corruptionPartial feature stream

corruption(some components are more sensitive than others)

Inaccurate noise-reduction processing.

Combinations.

Improved large-vocabulary speech Improved large-vocabulary speech recognition using syllablesrecognition using syllablesFast bootstrapping of initial phone

models of a new language.– Requires less training data

Generating baseforms (phonetic spellings) for phonetic languages.– Requires deep linguistic knowledge

Improved large-vocabulary speech Improved large-vocabulary speech recognition using syllablesrecognition using syllables Bootstrapping– Existing acoustic model is used to obtain

initial phone models.• Bootstrapping through alignment of target language

speech• Bootstrapping through alignment of base language

speech data.

Statistical baseform generation– Based on context-dependent decision trees

• Tree is built for each letter.

Multi-modal techniques for Multi-modal techniques for improved speech recognition (e.g., improved speech recognition (e.g., combining audio and visual combining audio and visual information)information) Focus on the problem of combining visual

cues with audio signals for the purpose of improved automatic machine recognition.

LVCSR – Large vocabulary continuous speech, significant progress, yet under controlled conditions.

Recognition of speech utterances with visual clues is limited to small vocabulary, speaker dependent training and isolated word speech.

Decision-tree unified multi-resolution Decision-tree unified multi-resolution models for speech communication on models for speech communication on mobile devices in noisy environmentsmobile devices in noisy environments Re-configurable multi-resolution decision-

tree modeling. Prediction of time varying spectrum of

non-stationary noise sources. Developing a unified model for speech

integrating features for recognition and synthesis including speaker adaptation.

Dynamic multi-resolution models to mitigate the impact of distortion of low-amplitude short-duration speech.

Modeling Voice, Accent and Modeling Voice, Accent and Emotion in Text to Speech SynthesisEmotion in Text to Speech Synthesis

Neutral Emotion


Bored emotion


Angry emotion


Happy emotion


Sad emotion


Frightened emotion

Basic Principles of ASRBasic Principles of ASR

All ASRs work in two phases

Training phase– System learns

reference patters

Recognizing phase– Unknown input

pattern is identified by considering the set of references

Three major modulesThree major modules

Signal processing front-end– Transforms speech

signals into sequence of feature vectors

Acoustic modeling– Recognizer matches the

sequence of observations with subword models

Language modeling– Recognized word is used

to construct a sentence.

OverviewOverview

Given the identities of all previous words, a language model is a conditional distribution on the identity of the I’th word in a sequence.

A trigram model, models language in a second-order Markov process.

It is clearly false because it makes the computationally convenient approximation that a word depends on only the two previous words.

Speech RecognitionSpeech Recognition

Speech recognition is all about understanding the human speech.

The ability to convert speech into a sequence of words or meaning and then into action.

The challenge is how to achieve this in the real world where unknown time varying noise is a factor.

Language ModelingLanguage Modeling

To be able to provide the probabilities of phrases occurring within a given context.

Improve the performance of speech recognition systems and internet search engines.

ReferencesReferences

http://www.research.microsoft.com/~joshuago

http://homepages.inf.ed.ac.uk/s0450736/slm.html

http://www.speech.sri.com/people/stolcke/papers/icassp96/paper.html

http://www.asel.udel.edu/icslp/cdrom/vol1/812/a812.pdf


http://www-2.cs.cmu.edu/afs/cs.cmu.edu/user/aberger/www/lm.html

http://www.cs.qub.ac.uk/~J.Ming/Html/Robust.htm

http://www.cs.qub.ac.uk/Research/NLSPOverview.html

http://www.research.ibm.com/people/l/lvsubram/publications/conferences/mmsp99.html

http://dea.brunel.ac.uk/cmsp/Proj_noise2003/obj.htm


http://database.syntheticspeech.de/index.html#samples

http://www.research.ibm.com/journal/rd/485/kumar.html

http://murray.newcastle.edu.au/users/staff/speech/home_pages/tutorial_sr.html

Documents

Speech and Language Modeling Shaz Husain Albert Kalim Kevin Leung Nathan Liang